Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
41 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

3D TransUNet: Advancing Medical Image Segmentation through Vision Transformers (2310.07781v1)

Published 11 Oct 2023 in cs.CV

Abstract: Medical image segmentation plays a crucial role in advancing healthcare systems for disease diagnosis and treatment planning. The u-shaped architecture, popularly known as U-Net, has proven highly successful for various medical image segmentation tasks. However, U-Net's convolution-based operations inherently limit its ability to model long-range dependencies effectively. To address these limitations, researchers have turned to Transformers, renowned for their global self-attention mechanisms, as alternative architectures. One popular network is our previous TransUNet, which leverages Transformers' self-attention to complement U-Net's localized information with the global context. In this paper, we extend the 2D TransUNet architecture to a 3D network by building upon the state-of-the-art nnU-Net architecture, and fully exploring Transformers' potential in both the encoder and decoder design. We introduce two key components: 1) A Transformer encoder that tokenizes image patches from a convolution neural network (CNN) feature map, enabling the extraction of global contexts, and 2) A Transformer decoder that adaptively refines candidate regions by utilizing cross-attention between candidate proposals and U-Net features. Our investigations reveal that different medical tasks benefit from distinct architectural designs. The Transformer encoder excels in multi-organ segmentation, where the relationship among organs is crucial. On the other hand, the Transformer decoder proves more beneficial for dealing with small and challenging segmented targets such as tumor segmentation. Extensive experiments showcase the significant potential of integrating a Transformer-based encoder and decoder into the u-shaped medical image segmentation architecture. TransUNet outperforms competitors in various medical applications.

An Analysis of "3D TransUNet: Advancing Medical Image Segmentation through Vision Transformers"

The paper "3D TransUNet: Advancing Medical Image Segmentation through Vision Transformers" proposes a novel approach to medical image segmentation that leverages the potent capabilities of Vision Transformers (ViTs), particularly in a 3D context. Building upon the foundational U-Net architecture, the authors introduce the 3D TransUNet, which integrates Transformers into both encoder and decoder components to overcome the inherent limitations of convolutional neural networks (CNNs) in modeling global contexts and long-range dependencies.

The paper identifies and addresses a critical limitation of conventional U-Nets, noting that while they excel in local feature extraction, their reliance on convolutional operations restricts their ability to model longer-range dependencies, which are essential for medical image segmentation tasks characterized by significant texture, shape, and size variations. By employing a Transformer-based architecture, known for its global self-attention mechanisms, the 3D TransUNet offers a promising alternative.

Integration of Transformers

The 3D TransUNet framework incorporates Transformers in two primary architectural components to enhance segmentation accuracy:

  1. Transformer Encoder: This component tokenizes image patches derived from CNN feature maps, allowing for a seamless fusion of global self-attentive features with the high-resolution features from CNNs. This integration maintains precise localization while modeling globale dependencies effectively.
  2. Transformer Decoder: By redefining the segmentation process as a mask classification problem, the Transformer Decoder utilizes learnable queries refined through cross-attention with localized CNN features. This hybrid approach leverages the strengths of both CNNs and Transformers, delivering improved segmentation results.

Notably, the paper introduces a coarse-to-fine attention mechanism within the Transformer decoder to enhance segmentation accuracy iteratively. By focusing on the foreground during cross-attention, this mechanism progressively refines the segmentation output, proving particularly effective in tasks involving small and challenging targets such as tumor segmentation.

Empirical Evaluation and Results

The paper conducts extensive experiments across multiple medical image segmentation tasks, including multi-organ segmentation and lesion/tumor detection. The results demonstrate the superior performance of 3D TransUNet over existing models, including baseline U-Nets and other Transformer-based architectures like nnformer and Swin UNETR.

The paper provides a nuanced comparison between different configurations of 3D TransUNet—Encoder-only, Decoder-only, and combined Encoder+Decoder. The findings indicate that the Encoder-only configuration shows marked improvements in multi-organ segmentation tasks due to its capacity to capture global organ relationships. Conversely, the Decoder-only configuration excels in segmenting small targets, attributable to its refined handling of such challenges.

Implications and Future Directions

The integration of Transformers within medical image segmentation tasks holds significant promise for advancing the field, addressing the key challenges posed by CNNs' failure to capture long-range dependencies. By showcasing the tailored benefits of employing both Transformer encoders and decoders based on task-specific requirements, the paper underscores the potential of hybrid architectures in medical applications.

As the Transformer model continues to undergo refinements and improvements, especially regarding computational efficiencies and scalability, it is likely that the future developments in the domain of AI and medical imaging will witness increased assimilation of such designs. Future variations of the 3D TransUNet might explore even more adaptive architectures, potentially integrating few-shot learning paradigms to improve its competency in handling diverse and rare medical imaging conditions.

In conclusion, the paper effectively demonstrates how combining Transformer architectures with traditional CNNs can provide more versatile and accurate solutions for complex medical segmentation tasks. The introduction of 3D TransUNet marks a substantive step in the expanding applications of Transformers in medical imaging, offering a promising path for future research and innovation.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (45)
  1. J. Long, E. Shelhamer, and T. Darrell, “Fully convolutional networks for semantic segmentation,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2015, pp. 3431–3440.
  2. O. Ronneberger, P. Fischer, and T. Brox, “U-net: Convolutional networks for biomedical image segmentation,” in International Conference on Medical image computing and computer-assisted intervention.   Springer, 2015, pp. 234–241.
  3. L. Yu, J.-Z. Cheng, Q. Dou, X. Yang, H. Chen, J. Qin, and P.-A. Heng, “Automatic 3d cardiovascular mr segmentation with densely-connected volumetric convnets,” in International Conference on Medical Image Computing and Computer-Assisted Intervention.   Springer, 2017, pp. 287–295.
  4. Y. Zhou, L. Xie, W. Shen, Y. Wang, E. K. Fishman, and A. L. Yuille, “A fixed-point model for pancreas segmentation in abdominal ct scans,” in International conference on medical image computing and computer-assisted intervention.   Springer, 2017, pp. 693–701.
  5. X. Li, H. Chen, X. Qi, Q. Dou, C.-W. Fu, and P.-A. Heng, “H-denseunet: hybrid densely connected unet for liver and tumor segmentation from ct volumes,” IEEE transactions on medical imaging, vol. 37, no. 12, pp. 2663–2674, 2018.
  6. Q. Yu, L. Xie, Y. Wang, Y. Zhou, E. K. Fishman, and A. L. Yuille, “Recurrent saliency transformation network: Incorporating multi-stage visual cues for small organ segmentation,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2018, pp. 8280–8289.
  7. X. Luo, J. Chen, T. Song, Y. Chen, G. Wang, and S. Zhang, “Semi-supervised medical image segmentation through dual-task consistency,” AAAI Conference on Artificial Intelligence, 2021.
  8. Z. Zhou, M. M. R. Siddiquee, N. Tajbakhsh, and J. Liang, “Unet++: A nested u-net architecture for medical image segmentation,” in Deep Learning in Medical Image Analysis and Multimodal Learning for Clinical Decision Support.   Springer, 2018, pp. 3–11.
  9. A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin, “Attention is all you need,” in Advances in neural information processing systems, 2017, pp. 5998–6008.
  10. J. Chen, Y. Lu, Q. Yu, X. Luo, E. Adeli, Y. Wang, L. Lu, A. L. Yuille, and Y. Zhou, “Transunet: Transformers make strong encoders for medical image segmentation,” arXiv preprint arXiv:2102.04306, 2021.
  11. X. Wang, R. Girshick, A. Gupta, and K. He, “Non-local neural networks,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2018, pp. 7794–7803.
  12. J. Schlemper, O. Oktay, M. Schaap, M. Heinrich, B. Kainz, B. Glocker, and D. Rueckert, “Attention gated networks: Learning to leverage salient regions in medical images,” Medical image analysis, vol. 53, pp. 197–207, 2019.
  13. N. Parmar, A. Vaswani, J. Uszkoreit, L. Kaiser, N. Shazeer, A. Ku, and D. Tran, “Image transformer,” in International Conference on Machine Learning.   PMLR, 2018, pp. 4055–4064.
  14. R. Child, S. Gray, A. Radford, and I. Sutskever, “Generating long sequences with sparse transformers,” arXiv preprint arXiv:1904.10509, 2019.
  15. A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai, T. Unterthiner, M. Dehghani, M. Minderer, G. Heigold, S. Gelly et al., “An image is worth 16x16 words: Transformers for image recognition at scale,” in ICLR, 2021.
  16. H.-Y. Zhou, J. Guo, Y. Zhang, X. Han, L. Yu, L. Wang, and Y. Yu, “nnformer: Volumetric medical image segmentation via a 3d transformer,” IEEE Transactions on Image Processing, 2023.
  17. Y. Xie, J. Zhang, C. Shen, and Y. Xia, “Cotr: Efficiently bridging cnn and transformer for 3d medical image segmentation,” in Medical Image Computing and Computer Assisted Intervention–MICCAI 2021: 24th International Conference, Strasbourg, France, September 27–October 1, 2021, Proceedings, Part III 24.   Springer, 2021, pp. 171–180.
  18. Z. Liu, Y. Lin, Y. Cao, H. Hu, Y. Wei, Z. Zhang, S. Lin, and B. Guo, “Swin transformer: Hierarchical vision transformer using shifted windows,” in Proceedings of the IEEE/CVF international conference on computer vision, 2021, pp. 10 012–10 022.
  19. H. Cao, Y. Wang, J. Chen, D. Jiang, X. Zhang, Q. Tian, and M. Wang, “Swin-unet: Unet-like pure transformer for medical image segmentation,” in European conference on computer vision.   Springer, 2022, pp. 205–218.
  20. A. Hatamizadeh, V. Nath, Y. Tang, D. Yang, H. R. Roth, and D. Xu, “Swin unetr: Swin transformers for semantic segmentation of brain tumors in mri images,” in International MICCAI Brainlesion Workshop.   Springer, 2021, pp. 272–284.
  21. N. Carion, F. Massa, G. Synnaeve, N. Usunier, A. Kirillov, and S. Zagoruyko, “End-to-end object detection with transformers,” in European conference on computer vision.   Springer, 2020, pp. 213–229.
  22. R. Strudel, R. Garcia, I. Laptev, and C. Schmid, “Segmenter: Transformer for semantic segmentation,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2021, pp. 7262–7272.
  23. H. Wang, Y. Zhu, H. Adam, A. Yuille, and L.-C. Chen, “Max-deeplab: End-to-end panoptic segmentation with mask transformers,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021, pp. 5463–5474.
  24. B. Cheng, A. Schwing, and A. Kirillov, “Per-pixel classification is not all you need for semantic segmentation,” Advances in Neural Information Processing Systems, vol. 34, 2021.
  25. B. Cheng, I. Misra, A. G. Schwing, A. Kirillov, and R. Girdhar, “Masked-attention mask transformer for universal image segmentation,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2022, pp. 1290–1299.
  26. Q. Yu, H. Wang, S. Qiao, M. Collins, Y. Zhu, H. Adam, A. Yuille, and L.-C. Chen, “k-means mask transformer,” in European Conference on Computer Vision.   Springer, 2022, pp. 288–307.
  27. Q. Yu, H. Wang, D. Kim, S. Qiao, M. Collins, Y. Zhu, H. Adam, A. Yuille, and L.-C. Chen, “Cmt-deeplab: Clustering mask transformers for panoptic segmentation,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 2560–2570.
  28. Z. Zhu, Y. Xia, W. Shen, E. Fishman, and A. Yuille, “A 3d coarse-to-fine framework for volumetric medical image segmentation,” in 2018 International conference on 3D vision (3DV).   IEEE, 2018, pp. 682–690.
  29. L. Xie, Q. Yu, Y. Zhou, Y. Wang, E. K. Fishman, and A. L. Yuille, “Recurrent saliency transformation network for tiny target segmentation in abdominal ct scans,” IEEE transactions on medical imaging, vol. 39, no. 2, pp. 514–525, 2019.
  30. F. Milletari, N. Navab, and S.-A. Ahmadi, “V-net: Fully convolutional neural networks for volumetric medical image segmentation,” in 3DV, 2016.
  31. S. Fu, Y. Lu, Y. Wang, Y. Zhou, W. Shen, E. Fishman, and A. Yuille, “Domain adaptive relational reasoning for 3d multi-organ segmentation,” in International Conference on Medical Image Computing and Computer-Assisted Intervention.   Springer, 2020, pp. 656–666.
  32. H. M. Luu and S.-H. Park, “Extending nn-unet for brain tumor segmentation,” arXiv preprint arXiv:2112.04653, 2021.
  33. M. Antonelli, A. Reinke, S. Bakas, K. Farahani, B. A. Landman, G. Litjens, B. Menze, O. Ronneberger, R. M. Summers, B. van Ginneken et al., “The medical segmentation decathlon,” arXiv preprint arXiv:2106.05735, 2021.
  34. Z. Zhu, Y. Xia, L. Xie, E. K. Fishman, and A. L. Yuille, “Multi-scale coarse-to-fine segmentation for screening pancreatic ductal adenocarcinoma,” in Medical Image Computing and Computer Assisted Intervention–MICCAI 2019: 22nd International Conference, Shenzhen, China, October 13–17, 2019, Proceedings, Part VI 22.   Springer, 2019, pp. 3–12.
  35. A. W. Moawad, A. Janas, U. Baid, D. Ramakrishnan, L. Jekel, K. Krantchev, H. Moy, R. Saluja, K. Osenberg, K. Wilms et al., “The brain tumor segmentation (brats-mets) challenge 2023: Brain metastasis segmentation on pre-treatment mri,” arXiv preprint arXiv:2306.00838, 2023.
  36. E. Oermann, K. Link, Z. Schnurman, C. Liu, Y. J. F. Kwon, L. Y. Jiang, M. Nasir-Moin, S. Neifert, J. Alzate, K. Bernstein et al., “Longitudinal deep neural networks for assessing metastatic brain cancer on a massive open benchmark.” preprint, 2023.
  37. J. D. Rudie, R. S. D. A. Weiss, P. Nedelec, E. Calabrese, J. B. Colby, B. Laguna, J. Mongan, S. Braunstein, C. P. Hess, A. M. Rauschecker et al., “The university of california san francisco, brain metastases stereotactic radiosurgery (ucsf-bmsr) mri dataset,” arXiv preprint arXiv:2304.07248, 2023.
  38. E. Grøvik, D. Yi, M. Iv, E. Tong, D. Rubin, and G. Zaharchuk, “Deep learning enables automatic detection and segmentation of brain metastases on multisequence mri,” Journal of Magnetic Resonance Imaging, vol. 51, no. 1, pp. 175–182, 2020.
  39. O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. Bernstein et al., “Imagenet large scale visual recognition challenge,” International journal of computer vision, vol. 115, pp. 211–252, 2015.
  40. H. Touvron, M. Cord, A. Sablayrolles, G. Synnaeve, and H. Jégou, “Going deeper with image transformers,” in Proceedings of the IEEE/CVF international conference on computer vision, 2021, pp. 32–42.
  41. H. Li, P. Xiong, J. An, and L. Wang, “Pyramid attention network for semantic segmentation,” arXiv preprint arXiv:1805.10180, 2018.
  42. L.-C. Chen, Y. Zhu, G. Papandreou, F. Schroff, and H. Adam, “Encoder-decoder with atrous separable convolution for semantic image segmentation,” in Proceedings of the European conference on computer vision (ECCV), 2018, pp. 801–818.
  43. Z. Zhou, M. M. R. Siddiquee, N. Tajbakhsh, and J. Liang, “Unet++: Redesigning skip connections to exploit multiscale features in image segmentation,” IEEE transactions on medical imaging, vol. 39, no. 6, pp. 1856–1867, 2019.
  44. F. Isensee, P. F. Jaeger, S. A. Kohl, J. Petersen, and K. H. Maier-Hein, “nnu-net: a self-configuring method for deep learning-based biomedical image segmentation,” Nature methods, vol. 18, no. 2, pp. 203–211, 2021.
  45. H. Peiris, M. Hayat, Z. Chen, G. Egan, and M. Harandi, “A robust volumetric transformer for accurate 3d tumor segmentation,” in International Conference on Medical Image Computing and Computer-Assisted Intervention.   Springer, 2022, pp. 162–172.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (15)
  1. Jieneng Chen (26 papers)
  2. Jieru Mei (26 papers)
  3. Xianhang Li (20 papers)
  4. Yongyi Lu (27 papers)
  5. Qihang Yu (44 papers)
  6. Qingyue Wei (8 papers)
  7. Xiangde Luo (31 papers)
  8. Yutong Xie (68 papers)
  9. Ehsan Adeli (97 papers)
  10. Yan Wang (733 papers)
  11. Matthew Lungren (10 papers)
  12. Lei Xing (83 papers)
  13. Le Lu (148 papers)
  14. Alan Yuille (294 papers)
  15. Yuyin Zhou (92 papers)
Citations (23)
X Twitter Logo Streamline Icon: https://streamlinehq.com