AgileFormer: Spatially Agile Transformer UNet for Medical Image Segmentation (2404.00122v2)
Abstract: In the past decades, deep neural networks, particularly convolutional neural networks, have achieved state-of-the-art performance in a variety of medical image segmentation tasks. Recently, the introduction of the vision transformer (ViT) has significantly altered the landscape of deep segmentation models. There has been a growing focus on ViTs, driven by their excellent performance and scalability. However, we argue that the current design of the vision transformer-based UNet (ViT-UNet) segmentation models may not effectively handle the heterogeneous appearance (e.g., varying shapes and sizes) of objects of interest in medical image segmentation tasks. To tackle this challenge, we present a structured approach to introduce spatially dynamic components to the ViT-UNet. This adaptation enables the model to effectively capture features of target objects with diverse appearances. This is achieved by three main components: \textbf{(i)} deformable patch embedding; \textbf{(ii)} spatially dynamic multi-head attention; \textbf{(iii)} deformable positional encoding. These components were integrated into a novel architecture, termed AgileFormer. AgileFormer is a spatially agile ViT-UNet designed for medical image segmentation. Experiments in three segmentation tasks using publicly available datasets demonstrated the effectiveness of the proposed method. The code is available at \href{https://github.com/sotiraslab/AgileFormer}{https://github.com/sotiraslab/AgileFormer}.
- Transunet: Transformers make strong encoders for medical image segmentation. arXiv preprint arXiv:2102.04306, 2021.
- Swin-unet: Unet-like pure transformer for medical image segmentation. In European conference on computer vision, pages 205–218. Springer, 2022.
- U-net: Convolutional networks for biomedical image segmentation. In MICCAI 2015: 18th International Conference, Munich, Germany, October 5-9, 2015, Proceedings, Part III 18, pages 234–241. Springer, 2015.
- Attention u-net: Learning where to look for the pancreas. arXiv preprint arXiv:1804.03999, 2018.
- 3d ux-net: A large kernel volumetric convnet modernizing hierarchical transformer for medical image segmentation. arXiv preprint arXiv:2209.15076, 2022.
- Dynamic u-net: Adaptively calibrate features for abdominal multi-organ segmentation. arXiv preprint arXiv:2403.07303, 2024.
- Brain tumor segmentation and radiomics survival prediction: Contribution to the brats 2017 challenge. In Brainlesion: Glioma, Multiple Sclerosis, Stroke and Traumatic Brain Injuries: Third International Workshop, BrainLes 2017, Held in Conjunction with MICCAI 2017, Quebec City, QC, Canada, September 14, 2017, Revised Selected Papers 3, pages 287–297. Springer, 2018.
- nnu-net: a self-configuring method for deep learning-based biomedical image segmentation. Nature methods, 18(2):203–211, 2021.
- An image is worth 16x16 words: Transformers for image recognition at scale. In International Conference on Learning Representations, 2020.
- Swin transformer: Hierarchical vision transformer using shifted windows. In Proceedings of the IEEE/CVF international conference on computer vision, pages 10012–10022, 2021.
- Multi-scale hierarchical vision transformer with cascaded attention decoding for medical image segmentation. In Medical Imaging with Deep Learning, pages 1526–1544. PMLR, 2024.
- Cotr: Efficiently bridging cnn and transformer for 3d medical image segmentation. In MICCAI 2021: 24th International Conference, Strasbourg, France, September 27–October 1, 2021, Proceedings, Part III 24, pages 171–180. Springer, 2021.
- Swin deformable attention u-net transformer (sdaut) for explainable fast mri. In International Conference on Medical Image Computing and Computer-Assisted Intervention, pages 538–548. Springer, 2022.
- Dat++: Spatially dynamic vision transformer with deformable attention. arXiv preprint arXiv:2309.01430, 2023.
- Vision transformer with deformable attention. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 4794–4803, 2022.
- Hiformer: Hierarchical multi-scale representations using transformers for medical image segmentation. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pages 6202–6212, 2023.
- Swin unetr: Swin transformers for semantic segmentation of brain tumors in mri images. In International MICCAI Brainlesion Workshop, pages 272–284. Springer, 2021.
- Unetr: Transformers for 3d medical image segmentation. In Proceedings of the IEEE/CVF winter conference on applications of computer vision, pages 574–584, 2022.
- Mixed transformer u-net for medical image segmentation. In ICASSP, pages 2390–2394. IEEE, 2022.
- nnformer: Volumetric medical image segmentation via a 3d transformer. IEEE Transactions on Image Processing, 2023.
- Deformable convolutional networks. In Proceedings of the IEEE international conference on computer vision, pages 764–773, 2017.
- Vision transformer with super token sampling. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 22690–22699, 2023.
- Neighborhood attention transformer. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 6185–6194, 2023.
- Uctransnet: rethinking the skip connections in u-net from a channel-wise perspective with transformer. In Proceedings of the AAAI conference on artificial intelligence, volume 36, pages 2441–2449, 2022.
- Transbts: Multimodal brain tumor segmentation using transformer. In MICCAI 2021: 24th International Conference, Strasbourg, France, September 27–October 1, 2021, Proceedings, Part I 24, pages 109–119. Springer, 2021.
- Self-attention with relative position representations. In Proceedings of NAACL-HLT, pages 464–468, 2018.
- Conditional positional encodings for vision transformers. In The Eleventh International Conference on Learning Representations, 2022.
- Miccai multi-atlas labeling beyond the cranial vault–workshop and challenge. In Proc. MICCAI Multi-Atlas Labeling Beyond Cranial Vault—Workshop Challenge, volume 5, page 12, 2015.
- Deep learning techniques for automatic mri cardiac multi-structures segmentation and diagnosis: is the problem solved? IEEE transactions on medical imaging, 37(11):2514–2525, 2018.
- The medical segmentation decathlon. Nature communications, 13(1):4128, 2022.
- Missformer: An effective transformer for 2d medical image segmentation. IEEE Transactions on Medical Imaging, 2022.
- After-unet: Axial fusion transformer unet for medical image segmentation. In Proceedings of the IEEE/CVF winter conference on applications of computer vision, pages 3971–3981, 2022.
- Medical image segmentation via cascaded attention decoding. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pages 6222–6231, 2023.
- Transdeeplab: Convolution-free transformer-based deeplab v3+ for medical image segmentation. In International Workshop on PRedictive Intelligence In MEdicine, pages 91–102. Springer, 2022.
- Class-aware adversarial transformers for medical image segmentation. Advances in Neural Information Processing Systems, 35:29582–29596, 2022.
- Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101, 2017.
- Deformable detr: Deformable transformers for end-to-end object detection. In International Conference on Learning Representations, 2020.
- Maxvit: Multi-axis vision transformer. In European conference on computer vision, pages 459–479. Springer, 2022.