Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
41 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

EViT: An Eagle Vision Transformer with Bi-Fovea Self-Attention (2310.06629v4)

Published 10 Oct 2023 in cs.CV

Abstract: Owing to advancements in deep learning technology, Vision Transformers (ViTs) have demonstrated impressive performance in various computer vision tasks. Nonetheless, ViTs still face some challenges, such as high computational complexity and the absence of desirable inductive biases. To alleviate these issues, {the potential advantages of combining eagle vision with ViTs are explored. We summarize a Bi-Fovea Visual Interaction (BFVI) structure inspired by the unique physiological and visual characteristics of eagle eyes. A novel Bi-Fovea Self-Attention (BFSA) mechanism and Bi-Fovea Feedforward Network (BFFN) are proposed based on this structural design approach, which can be used to mimic the hierarchical and parallel information processing scheme of the biological visual cortex, enabling networks to learn feature representations of targets in a coarse-to-fine manner. Furthermore, a Bionic Eagle Vision (BEV) block is designed as the basic building unit based on the BFSA mechanism and BFFN. By stacking BEV blocks, a unified and efficient family of pyramid backbone networks called Eagle Vision Transformers (EViTs) is developed. Experimental results show that EViTs exhibit highly competitive performance in various computer vision tasks, such as image classification, object detection and semantic segmentation. Compared with other approaches, EViTs have significant advantages, especially in terms of performance and computational efficiency. Code is available at https://github.com/nkusyl/EViT

Definition Search Book Streamline Icon: https://streamlinehq.com
References (78)
  1. Y. Xu, H. Wei, M. Lin, Y. Deng, K. Sheng, M. Zhang, F. Tang, W. Dong, F. Huang, and C. Xu, “Transformers in computational visual media: A survey,” Computational Visual Media, vol. 8, pp. 33–62, 2022.
  2. W. Li, H. Chen, J. Guo, Z. Zhang, and Y. Wang, “Brain-inspired multilayer perceptron with spiking neurons,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 783–793.
  3. A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin, “Attention is all you need,” Advances in Neural Information Processing Systems, vol. 30, 2017.
  4. J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, “Bert: Pre-training of deep bidirectional transformers for language understanding,” arXiv preprint arXiv:1810.04805, 2018.
  5. S. Khan, M. Naseer, M. Hayat, S. W. Zamir, F. S. Khan, and M. Shah, “Transformers in vision: A survey,” ACM Computing Surveys (CSUR), vol. 54, no. 10s, pp. 1–41, 2022.
  6. K. Han, Y. Wang, H. Chen, X. Chen, J. Guo, Z. Liu, Y. Tang, A. Xiao, C. Xu, Y. Xu et al., “A survey on vision transformer,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 45, no. 1, pp. 87–110, 2022.
  7. R. Shi, T. Li, L. Zhang, and Y. Yamaguchi, “Visualization comparison of vision transformers and convolutional neural networks,” IEEE Transactions on Multimedia, 2023.
  8. S. Chen, A. Atapour-Abarghouei, and H. P. Shum, “Hint: High-quality inpainting transformer with mask-aware encoding and enhanced attention,” arXiv preprint arXiv:2402.14185, 2024.
  9. J. Gu, H. Kwon, D. Wang, W. Ye, M. Li, Y.-H. Chen, L. Lai, V. Chandra, and D. Z. Pan, “Multi-scale high-resolution vision transformer for semantic segmentation,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 12 094–12 103.
  10. J. Zhang, X. Li, Y. Wang, C. Wang, Y. Yang, Y. Liu, and D. Tao, “Eatformer: Improving vision transformer inspired by evolutionary algorithm,” arXiv preprint arXiv:2206.09325, 2022.
  11. C. Ma, L. Zhuo, J. Li, Y. Zhang, and J. Zhang, “Cascade transformer decoder based occluded pedestrian detection with dynamic deformable convolution and gaussian projection channel attention mechanism,” IEEE Transactions on Multimedia, 2023.
  12. C.-F. R. Chen, Q. Fan, and R. Panda, “Crossvit: Cross-attention multi-scale vision transformer for image classification,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2021, pp. 357–366.
  13. S. Bhojanapalli, A. Chakrabarti, D. Glasner, D. Li, T. Unterthiner, and A. Veit, “Understanding robustness of transformers for image classification,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2021, pp. 10 231–10 241.
  14. X. Lin, S. Sun, W. Huang, B. Sheng, P. Li, and D. D. Feng, “Eapt: efficient attention pyramid transformer for image processing,” IEEE Transactions on Multimedia, vol. 25, pp. 50–61, 2021.
  15. N. Carion, F. Massa, G. Synnaeve, N. Usunier, A. Kirillov, and S. Zagoruyko, “End-to-end object detection with transformers,” in European Conference on Computer Vision.   Springer, 2020, pp. 213–229.
  16. X. Zhu, W. Su, L. Lu, B. Li, X. Wang, and J. Dai, “Deformable detr: Deformable transformers for end-to-end object detection,” arXiv preprint arXiv:2010.04159, 2020.
  17. J. Jiao, Y.-M. Tang, K.-Y. Lin, Y. Gao, A. J. Ma, Y. Wang, and W.-S. Zheng, “Dilateformer: Multi-scale dilated transformer for visual recognition,” IEEE Transactions on Multimedia, vol. 25, pp. 8906–8919, 2023.
  18. Z. Xue, X. Tan, X. Yu, B. Liu, A. Yu, and P. Zhang, “Deep hierarchical vision transformer for hyperspectral and lidar data classification,” IEEE Transactions on Image Processing, vol. 31, pp. 3095–3110, 2022.
  19. T.-C. Hsu, Y.-S. Liao, and C.-R. Huang, “Video summarization with spatiotemporal vision transformer,” IEEE Transactions on Image Processing, 2023.
  20. J. He, J. Deng, T. Zhang, Z. Zhang, and Y. Zhang, “Hierarchical shape-consistent transformer for unsupervised point cloud shape correspondence,” IEEE Transactions on Image Processing, 2023.
  21. J. Lin, L. Yin, and Y. Wang, “Steformer: Efficient stereo image super-resolution with transformer,” IEEE Transactions on Multimedia, 2023.
  22. A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai, T. Unterthiner, M. Dehghani, M. Minderer, G. Heigold, S. Gelly et al., “An image is worth 16x16 words: Transformers for image recognition at scale,” arXiv preprint arXiv:2010.11929, 2020.
  23. Z. Liu, Y. Lin, Y. Cao, H. Hu, Y. Wei, Z. Zhang, S. Lin, and B. Guo, “Swin transformer: Hierarchical vision transformer using shifted windows,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2021, pp. 10 012–10 022.
  24. Z. Tu, H. Talebi, H. Zhang, F. Yang, P. Milanfar, A. Bovik, and Y. Li, “Maxvit: Multi-axis vision transformer,” in European Conference on Computer Vision.   Springer, 2022, pp. 459–479.
  25. T. Yao, Y. Li, Y. Pan, Y. Wang, X.-P. Zhang, and T. Mei, “Dual vision transformer,” IEEE Transactions on Pattern Analysis and Machine Intelligence, 2023.
  26. H. Fan, B. Xiong, K. Mangalam, Y. Li, Z. Yan, J. Malik, and C. Feichtenhofer, “Multiscale vision transformers,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2021, pp. 6824–6835.
  27. D. Rao, T. Xu, and X.-J. Wu, “Tgfuse: An infrared and visible image fusion approach based on transformer and generative adversarial network,” IEEE Transactions on Image Processing, 2023.
  28. W. Tang, F. He, Y. Liu, and Y. Duan, “Matr: Multimodal medical image fusion via multiscale adaptive transformer,” IEEE Transactions on Image Processing, vol. 31, pp. 5134–5149, 2022.
  29. B. Jiang, S. Luo, X. Wang, C. Li, and J. Tang, “Amatformer: Efficient feature matching via anchor matching transformer,” IEEE Transactions on Multimedia, 2023.
  30. K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2016, pp. 770–778.
  31. W. Wang, E. Xie, X. Li, D.-P. Fan, K. Song, D. Liang, T. Lu, P. Luo, and L. Shao, “Pyramid vision transformer: A versatile backbone for dense prediction without convolutions,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2021, pp. 568–578.
  32. H. Wu, B. Xiao, N. Codella, M. Liu, X. Dai, L. Yuan, and L. Zhang, “Cvt: Introducing convolutions to vision transformers,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2021, pp. 22–31.
  33. J. Guo, K. Han, H. Wu, Y. Tang, X. Chen, Y. Wang, and C. Xu, “Cmt: Convolutional neural networks meet vision transformers,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 12 175–12 185.
  34. O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. Bernstein et al., “Imagenet large scale visual recognition challenge,” International Journal of Computer Vision, vol. 115, pp. 211–252, 2015.
  35. M. Mitkus, S. Potier, G. R. Martin, O. Duriez, and A. Kelber, “Raptor vision,” in Oxford Research Encyclopedia of Neuroscience, 2018.
  36. J. González-Martín-Moro, J. Hernández-Verdejo, and A. Clement-Corral, “The visual system of diurnal raptors: updated review,” Archivos de la Sociedad Española de Oftalmología (English Edition), vol. 92, no. 5, pp. 225–232, 2017.
  37. A. Bringmann, “Structure and function of the bird fovea,” Anatomia, Histologia, Embryologia, vol. 48, no. 3, pp. 177–200, 2019.
  38. T. Brown, B. Mann, N. Ryder, M. Subbiah, J. D. Kaplan, P. Dhariwal, A. Neelakantan, P. Shyam, G. Sastry, A. Askell et al., “Language models are few-shot learners,” Advances in Neural Information Processing Systems, vol. 33, pp. 1877–1901, 2020.
  39. T. Wolf, L. Debut, V. Sanh, J. Chaumond, C. Delangue, A. Moi, P. Cistac, T. Rault, R. Louf, M. Funtowicz et al., “Transformers: State-of-the-art natural language processing,” in Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, 2020, pp. 38–45.
  40. H. Touvron, M. Cord, M. Douze, F. Massa, A. Sablayrolles, and H. Jégou, “Training data-efficient image transformers & distillation through attention,” in International Conference on Machine Learning.   PMLR, 2021, pp. 10 347–10 357.
  41. Y. Su, J. Deng, R. Sun, G. Lin, H. Su, and Q. Wu, “A unified transformer framework for group-based segmentation: Co-segmentation, co-saliency detection and video salient object detection,” IEEE Transactions on Multimedia, 2023.
  42. W. Wang, E. Xie, X. Li, D.-P. Fan, K. Song, D. Liang, T. Lu, P. Luo, and L. Shao, “Pvt v2: Improved baselines with pyramid vision transformer,” Computational Visual Media, vol. 8, no. 3, pp. 415–424, 2022.
  43. Q. Zhang and Y.-B. Yang, “Rest: An efficient transformer for visual recognition,” Advances in Neural Information Processing Systems, vol. 34, pp. 15 475–15 485, 2021.
  44. Q. Chen, Q. Wu, J. Wang, Q. Hu, T. Hu, E. Ding, J. Cheng, and J. Wang, “Mixformer: Mixing features across windows and dimensions,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 5249–5259.
  45. C. Wang, H. Xu, X. Zhang, L. Wang, Z. Zheng, and H. Liu, “Convolutional embedding makes hierarchical vision transformer stronger,” in European Conference on Computer Vision.   Springer, 2022, pp. 739–756.
  46. Z. Pan, J. Cai, and B. Zhuang, “Fast vision transformers with hilo attention,” Advances in Neural Information Processing Systems, vol. 35, pp. 14 541–14 554, 2022.
  47. Q. Zhang and Y.-B. Yang, “Rest v2: simpler, faster and stronger,” Advances in Neural Information Processing Systems, vol. 35, pp. 36 440–36 452, 2022.
  48. Z. Liu, H. Mao, C.-Y. Wu, C. Feichtenhofer, T. Darrell, and S. Xie, “A convnet for the 2020s,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2022, pp. 11 976–11 986.
  49. H. Huang, X. Zhou, J. Cao, R. He, and T. Tan, “Vision transformer with super token sampling,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 22 690–22 699.
  50. Y. Lee, J. Kim, J. Willette, and S. J. Hwang, “Mpvit: Multi-path vision transformer for dense prediction,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 7287–7296.
  51. P. Shaw, J. Uszkoreit, and A. Vaswani, “Self-attention with relative position representations,” in Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 2 (Short Papers), 2018, pp. 464–468.
  52. T.-Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Dollár, and C. L. Zitnick, “Microsoft coco: Common objects in context,” in Computer Vision–ECCV 2014: 13th European Conference, Zurich, Switzerland, September 6-12, 2014, Proceedings, Part V 13.   Springer, 2014, pp. 740–755.
  53. B. Zhou, H. Zhao, X. Puig, S. Fidler, A. Barriuso, and A. Torralba, “Scene parsing through ade20k dataset,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2017, pp. 633–641.
  54. Z. Zhong, L. Zheng, G. Kang, S. Li, and Y. Yang, “Random erasing data augmentation,” in Proceedings of the AAAI Conference on Artificial Intelligence, vol. 34, 2020, pp. 13 001–13 008.
  55. S. Yun, D. Han, S. J. Oh, S. Chun, J. Choe, and Y. Yoo, “Cutmix: Regularization strategy to train strong classifiers with localizable features,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2019, pp. 6023–6032.
  56. H. Zhang, M. Cisse, Y. N. Dauphin, and D. Lopez-Paz, “mixup: Beyond empirical risk minimization,” arXiv preprint arXiv:1710.09412, 2017.
  57. C. Szegedy, V. Vanhoucke, S. Ioffe, J. Shlens, and Z. Wojna, “Rethinking the inception architecture for computer vision,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2016, pp. 2818–2826.
  58. Y. Li, K. Zhang, J. Cao, R. Timofte, and L. Van Gool, “Localvit: Bringing locality to vision transformers,” arXiv preprint arXiv:2104.05707, 2021.
  59. L. Yuan, Y. Chen, T. Wang, W. Yu, Y. Shi, Z.-H. Jiang, F. E. Tay, J. Feng, and S. Yan, “Tokens-to-token vit: Training vision transformers from scratch on imagenet,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2021, pp. 558–567.
  60. J. Yang, C. Li, P. Zhang, X. Dai, B. Xiao, L. Yuan, and J. Gao, “Focal self-attention for local-global interactions in vision transformers,” arXiv preprint arXiv:2107.00641, 2021.
  61. P. Zhang, X. Dai, J. Yang, B. Xiao, L. Yuan, L. Zhang, and J. Gao, “Multi-scale vision longformer: A new vision transformer for high-resolution image encoding,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2021, pp. 2998–3008.
  62. H. Touvron, M. Cord, A. Sablayrolles, G. Synnaeve, and H. Jégou, “Going deeper with image transformers,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2021, pp. 32–42.
  63. T.-Y. Lin, P. Goyal, R. Girshick, K. He, and P. Dollár, “Focal loss for dense object detection,” in Proceedings of the IEEE International Conference on Computer Vision, 2017, pp. 2980–2988.
  64. K. He, G. Gkioxari, P. Dollár, and R. Girshick, “Mask r-cnn,” in Proceedings of the IEEE International Conference on Computer Vision, 2017, pp. 2961–2969.
  65. Z. Xia, X. Pan, S. Song, L. E. Li, and G. Huang, “Vision transformer with deformable attention,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 4794–4803.
  66. I. Radosavovic, R. P. Kosaraju, R. Girshick, K. He, and P. Dollár, “Designing network design spaces,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2020, pp. 10 428–10 436.
  67. K. Chen, J. Wang, J. Pang, Y. Cao, Y. Xiong, X. Li, S. Sun, W. Feng, Z. Liu, J. Xu et al., “Mmdetection: Open mmlab detection toolbox and benchmark,” arXiv preprint arXiv:1906.07155, 2019.
  68. A. Kirillov, R. Girshick, K. He, and P. Dollár, “Panoptic feature pyramid networks,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2019, pp. 6399–6408.
  69. J. Krause, M. Stark, J. Deng, and L. Fei-Fei, “3d object representations for fine-grained categorization,” in Proceedings of the IEEE International Conference on Computer Vision Workshops, 2013, pp. 554–561.
  70. M.-E. Nilsback and A. Zisserman, “Automated flower classification over a large number of classes,” in 2008 Sixth Indian Conference on Computer Vision, Graphics & Image processing.   IEEE, 2008, pp. 722–729.
  71. O. M. Parkhi, A. Vedaldi, A. Zisserman, and C. Jawahar, “Cats and dogs,” in 2012 IEEE Conference on Computer Vision and Pattern Recognition.   IEEE, 2012, pp. 3498–3505.
  72. G. Van Horn, O. Mac Aodha, Y. Song, A. Shepard, H. Adam, P. Perona, and S. Belongie, “The inaturalist challenge 2017 dataset,” arXiv preprint arXiv:1707.06642, vol. 1, no. 2, p. 4, 2017.
  73. A. Krizhevsky, G. Hinton et al., “Learning multiple layers of features from tiny images,” 2009.
  74. H. Touvron, A. Sablayrolles, M. Douze, M. Cord, and H. Jégou, “Grafit: Learning fine-grained image representations with coarse labels,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2021, pp. 874–884.
  75. M. Tan and Q. Le, “Efficientnet: Rethinking model scaling for convolutional neural networks,” in International Conference on Machine Learning.   PMLR, 2019, pp. 6105–6114.
  76. K. Yuan, S. Guo, Z. Liu, A. Zhou, F. Yu, and W. Wu, “Incorporating convolution designs into visual transformers,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2021, pp. 579–588.
  77. K. Han, A. Xiao, E. Wu, J. Guo, C. Xu, and Y. Wang, “Transformer in transformer,” Advances in Neural Information Processing Systems, vol. 34, pp. 15 908–15 919, 2021.
  78. Y. Xu, Q. Zhang, J. Zhang, and D. Tao, “Vitae: Vision transformer advanced by exploring intrinsic inductive bias,” Advances in Neural Information Processing Systems, vol. 34, pp. 28 522–28 535, 2021.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (5)
  1. Yulong Shi (5 papers)
  2. Mingwei Sun (10 papers)
  3. Yongshuai Wang (2 papers)
  4. Jiahao Ma (11 papers)
  5. Zengqiang Chen (6 papers)
Citations (1)