BRAU-Net++: U-Shaped Hybrid CNN-Transformer Network for Medical Image Segmentation (2401.00722v2)
Abstract: Accurate medical image segmentation is essential for clinical quantification, disease diagnosis, treatment planning and many other applications. Both convolution-based and transformer-based u-shaped architectures have made significant success in various medical image segmentation tasks. The former can efficiently learn local information of images while requiring much more image-specific inductive biases inherent to convolution operation. The latter can effectively capture long-range dependency at different feature scales using self-attention, whereas it typically encounters the challenges of quadratic compute and memory requirements with sequence length increasing. To address this problem, through integrating the merits of these two paradigms in a well-designed u-shaped architecture, we propose a hybrid yet effective CNN-Transformer network, named BRAU-Net++, for an accurate medical image segmentation task. Specifically, BRAU-Net++ uses bi-level routing attention as the core building block to design our u-shaped encoder-decoder structure, in which both encoder and decoder are hierarchically constructed, so as to learn global semantic information while reducing computational complexity. Furthermore, this network restructures skip connection by incorporating channel-spatial attention which adopts convolution operations, aiming to minimize local spatial information loss and amplify global dimension-interaction of multi-scale features. Extensive experiments on three public benchmark datasets demonstrate that our proposed approach surpasses other state-of-the-art methods including its baseline: BRAU-Net under almost all evaluation metrics. We achieve the average Dice-Similarity Coefficient (DSC) of 82.47, 90.10, and 92.94 on Synapse multi-organ segmentation, ISIC-2018 Challenge, and CVC-ClinicDB, as well as the mIoU of 84.01 and 88.17 on ISIC-2018 Challenge and CVC-ClinicDB, respectively.
- J. Chen, Y. Lu, Q. Yu, X. Luo, E. Adeli, Y. Wang, L. Lu, A. L. Yuille, and Y. Zhou, “TransUNet: Transformers make strong Encoders for medical image segmentation,” arXiv:2102.04306, 2021.
- R. Azad, Y. Jia, E. K. Aghdam, J. Cohen-Adad, and D. Merhof, “Enhancing medical image segmentation with TransCeption: A multi-scale feature fusion approach,” arXiv:2102.04306, 2023.
- J. Li, M. Erdt, F. Janoos, T. Chang, and Jan Egger, “Medical image segmentation in oral-maxillofacial surgery,” in Computer-Aided Oral and Maxillofacial Surgery, J. Egger and X. Chen, Ed. Academic Press, 2021, pp. 1–27.
- A. S. Ashour, Y. Guo, and W. S. Mohamed, “Image-guided thermal ablation therapy,” in Thermal Ablation Therapy, A. S. Ashour, Y. Guo, and W. S. Mohamed, Ed. Academic Press, 2021, pp. 411–440.
- O. Ronneberger, P. Fischer, and T. Brox, “U-Net: Convolutional networks for biomedical image segmentation,” in Proc. Int. Conf. Med. Imag. Comput. Comput.-Assist. Interv., 2015, pp. 234–241.
- J. Long, E. Shelhamer, and T. Darrell, “Fully convolutional networks for semantic segmentation,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., Boston, MA, USA, 2015, pp. 3431–3440.
- F. Milletari, N. Navab, and S. -A. Ahmadi, “V-Net: Fully convolutional neural networks for volumetric medical image segmentation,” in Proc. IEEE 4th Int. Conf. 3D Vis., Stanford, CA, USA, 2016, pp. 565–571.
- L. -C. Chen, G. Papandreou, I. Kokkinos, K. Murphy, and A. L. Yuille, “DeepLab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected CRFs,” EEE Trans. Pattern Anal. Mach. Intell., vol. 40, no. 4, pp. 834–848, Apr. 2018.
- X. Wang, R. Girshick, A. Gupta, and K. He, “Non-local neural networks,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., Salt Lake City, UT, USA, 2018, pp. 7794–7803.
- H. Zhao, J. Shi, X. Qi, X. Wang, and J. Jia, “Pyramid scene parsing network,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., Honolulu, HI, USA, 2017, pp. 6230–6239.
- A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser, and I. Polosukhin, “Attention is all you need,” in Proc. Adv. Neural Inf. Process. Syst., 2017, pp.5998–6008.
- N. Carion, F. Massa, G. Synnaeve, N. Usunier, A. Kirillov, and S. Zagoruyko, “End-to-end object detection with transformers,” in Proc. Eur. Conf. Comput. Vis., 2020, pp. 213–229.
- L. Zhu, X. Wang, Z. Ke, W. Zhang, and R. Lau, “Biformer: Vision transformer with bi-level routing attention,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., Vancouver, BC, Canada, 2023, pp. 10323–10333.
- S. Tang, J. Zhang, S, Zhu, and P. Tan, “Quadtree attention for vision transformers,” in Proc. Int. Conf. Learn. Representations, 2023.
- Z. Tu, H. Talebi, H. Zhang, F. Yang, P. Milanfar, A. Bovik, and Y. Li, “MaxViT: Multi-axis vision transformer,” in Proc. Eur. Conf. Comput. Vis., 2022, pp. 459–479.
- W. Wang, L. Yao, L. Chen, B. Lin, D. Cai, X. He, and W. Liu, “CrossFormer: A versatile vision transformer hinging on cross-scale attention,” in Proc. Int. Conf. Learn. Representations, 2022.
- H. Wang, Y. Zhu, B. Green, H. Adam, A. Yuille, and L. Chen, “Axial-DeepLab: Stand-alone axial-attention for panoptic segmentation,” in Proc. Eur. Conf. Comput. Vis., 2020, pp. 108–126.
- H. -Y. Zhou, J. Guo, Y. Zhang, X. Han, L. Yu, L. Wang, and Y. Yu, “nnFormer: Volumetric medical image segmentation via a 3D transformer,” IEEE Trans. Image Process., vol. 32, pp. 4036–4045, 2023.
- Y. Gao, M. Zhou, and D. N. Metaxas, “UTNet: A hybrid transformer architecture for medical image segmentation,” in Proc. Int. Conf. Med. Imag. Comput. Comput.-Assist. Interv., 2021, pp. 61–71.
- M. Naderi, M. Givkashi, F. Piri, N. Karimi, and S. Samavi, “Focal-UNet: UNet-like focal modulation for medical image segmentation,” arXiv:2212.09263, 2022.
- X. Huang, Z. Deng, D. Li, and X. Yuan, “MISSFormer: An effective medical image segmentation transformer,” arXiv:2109.07162, 2021.
- H. Cao, Y. Wang, J. Chen, D. Jiang, X. Zhang, Q.Tian, and M. Wang, “Swin-Unet: Unet-like pure transformer for medical image segmentation,” in Proc. Eur. Conf. Comput. Vis., 2022, pp. 205–218.
- R. Child, S. Gray, A. Radford, I. Sutskever, “Generating long sequences with sparse transformers,” arXiv:1904.10509, 2019.
- P. Cai, L. Jiang, Y. Li, and L. Lan, “Pubic symphysis-fetal head segmentation using pure transformer with bi-level routing attention,” arXiv:2310.00289, 2023.
- P. Tschandl, C. Rosendahl, and H. Kittler1, “Data Descriptor: The HAM10000 dataset, a large collection of multi-source dermatoscopic images of common pigmented skin lesions,” Sci. Data, 2018.
- B. Chen, Y. Liu, Z. Zhang, G. Lu, and A. W. K. Kong, “TransAttUnet: Multi-level attention-guided u-net with transformer for medical image segmentation,” IEEE Trans. Emerg. Topics Comput. Intell., 2023.
- Q. Xu, Z. Ma, N. He, and W. Duan, “DCSAU-Net: A deeper and more compact split-attention U-Net for medical image segmentation,” Comput. Biol. Med., vol. 154, pp. 106626, 2023.
- N. Siddique, S. Paheding, C. P. Elkin, and V. Devabhaktuni, “U-Net and its variants for medical image segmentation: A review of theory and applications,” IEEE Access, vol. 9, pp. 82031–82057, 2021.
- A. Hatamizadeh, V. Nath, Y. Tang, D. Yang, H. R. Roth, and D. Xu, “Swin UNETR: Swin transformers for semantic segmentation of brain tumors in MRI images,” in Proc. Int. Conf. Med. Imag. Comput. Comput.-Assist. Interv. Brainlesion Workshop, 2021, pp. 272–284.
- Y. Rao, W. Zhao, B. Liu, J. Lu, J. Zhou, and C. Hsieh, “DynamicViT: Efficient vision transformers with dynamic token sparsification,” in Proc. Adv. Neural Inf. Process. Syst., 2021, pp. 13937–13949.
- J. Hu, L. Shen, and G. Sun, “Squeeze-and-excitation networks,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., Salt Lake City, UT, USA, 2018, pp. 7132–7141.
- M. Jaderberg, K. Simonyan, A, Zisserman, and K. Kavukcuoglu, “Spatial transformer networks,” in Proc. Adv. Neural Inf. Process. Syst., 2015.
- Z. Xia, X. Pan, S. Song, L. E. Li, and G. Huang. “Vision transformer with deformable attention,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., New Orleans, LA, USA, 2022, pp. 4784–4793
- J. Deng, W. Dong, R. Socher, L.-J. Li; K. Li, and Li Fei-Fei, “ Imagenet: A large-scale hierarchical image database,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., Miami, FL, USA, 2009, pp. 248–255.
- D. Kingma, and J. Ba, “Adam: A method for stochastic optimization,” in Proc. Int. Conf. Learn. Representations, 2015.
- M. M. Rahman, and R. Marculescu, “Medical image segmentation via cascaded attention decoding,” in Proc. IEEE Winter Conf. Appl. Comput. Vis., Waikoloa, HI, USA, 2023, pp. 6211–6220.
- Libin Lan (11 papers)
- Pengzhou Cai (6 papers)
- Lu Jiang (90 papers)
- Xiaojuan Liu (10 papers)
- Yongmei Li (4 papers)
- Yudong Zhang (56 papers)