Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
41 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

MiM: Mask in Mask Self-Supervised Pre-Training for 3D Medical Image Analysis (2404.15580v1)

Published 24 Apr 2024 in cs.CV

Abstract: The Vision Transformer (ViT) has demonstrated remarkable performance in Self-Supervised Learning (SSL) for 3D medical image analysis. Mask AutoEncoder (MAE) for feature pre-training can further unleash the potential of ViT on various medical vision tasks. However, due to large spatial sizes with much higher dimensions of 3D medical images, the lack of hierarchical design for MAE may hinder the performance of downstream tasks. In this paper, we propose a novel \textit{Mask in Mask (MiM)} pre-training framework for 3D medical images, which aims to advance MAE by learning discriminative representation from hierarchical visual tokens across varying scales. We introduce multiple levels of granularity for masked inputs from the volume, which are then reconstructed simultaneously ranging at both fine and coarse levels. Additionally, a cross-level alignment mechanism is applied to adjacent level volumes to enforce anatomical similarity hierarchically. Furthermore, we adopt a hybrid backbone to enhance the hierarchical representation learning efficiently during the pre-training. MiM was pre-trained on a large scale of available 3D volumetric images, \textit{i.e.,} Computed Tomography (CT) images containing various body parts. Extensive experiments on thirteen public datasets demonstrate the superiority of MiM over other SSL methods in organ/lesion/tumor segmentation and disease classification. We further scale up the MiM to large pre-training datasets with more than 10k volumes, showing that large-scale pre-training can further enhance the performance of downstream tasks. The improvement also concluded that the research community should pay more attention to the scale of the pre-training dataset towards the healthcare foundation model for 3D medical images.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (95)
  1. Gpt-4 technical report. arXiv preprint arXiv:2303.08774 .
  2. Ct images in covid-19 dataset. The Cancer Imaging Archive 10.
  3. The lung image database consortium (lidc) and image database resource initiative (idri): a completed reference database of lung nodules on ct scans. Medical physics 38, 915–931.
  4. Emerging properties in self-supervised vision transformers, in: ICCV, pp. 9650–9660.
  5. Contrastive learning of global and local features for medical image segmentation with limited annotations. NeurIPS 33, 12546–12558.
  6. Mixed autoencoder for self-supervised visual representation learning, in: CVPR, pp. 22742–22751.
  7. Self-supervised learning for medical image analysis using image context restoration. MIA 58, 101539.
  8. Jigsaw clustering for unsupervised visual representation learning, in: CVPR, pp. 11526–11535.
  9. Scaling vision transformers to gigapixel images via hierarchical self-supervised learning, in: CVPR.
  10. A simple framework for contrastive learning of visual representations, in: ICML, PMLR. pp. 1597–1607.
  11. Big self-supervised models are strong semi-supervised learners, in: NeurIPS, pp. 22243–22255.
  12. Improved baselines with momentum contrastive learning, in: CVPR, pp. 9729–9738.
  13. Improved baselines with momentum contrastive learning. CVPR .
  14. Exploring simple siamese representation learning, in: arXiv preprint arXiv:2011.10566.
  15. An empirical study of training self-supervised vision transformers. in 2021 ieee, in: ICCV, pp. 9620–9629.
  16. Masked image modeling advances 3d medical image analysis, in: WACV, pp. 1970–1980.
  17. The cancer imaging archive (tcia): maintaining and operating a public information repository. Journal of digital imaging 26, 1045–1057.
  18. An image is worth 16x16 words: Transformers for image recognition at scale. ICLR .
  19. Discriminative unsupervised feature learning with exemplar convolutional neural networks, pp. 1734–1747.
  20. Scalable pre-training of large autoregressive image models. arXiv preprint arXiv:2401.08541 .
  21. Evolved part masking for self-supervised learning, in: CVPR, pp. 10386–10395.
  22. MCMAE: Masked Convolution Meets Masked Autoencoders, in: NeurIPS, pp. 35632–35644.
  23. Disco: Remedying self-supervised learning on lightweight models with distilled contrastive learning, in: ECCV, Springer. pp. 237–253.
  24. Bootstrap your own latent-a new approach to self-supervised learning. NeurIPS 33, 21271–21284.
  25. Imaging and clinical data archive for head and neck squamous cell carcinoma patients treated with radiotherapy. Scientific data 5, 1–10.
  26. Dira: discriminative, restorative, and adversarial learning for self-supervised medical image analysis, in: CVPR, pp. 20824–20834.
  27. Artificial intelligence for the detection of covid-19 pneumonia on chest ct using multinational datasets. Nature communications 11, 4080.
  28. Swin unetr: Swin transformers for semantic segmentation of brain tumors in mri images, in: MICCAIW, Springer. pp. 272–284.
  29. UNETR : Transformers for 3d medical image segmentation, in: WACV, pp. 574–584.
  30. Masked autoencoders are scalable vision learners, in: CVPR, pp. 16000–16009.
  31. Momentum contrast for unsupervised visual representation learning, in: CVPR, pp. 9729–9738.
  32. Deep residual learning for image recognition, in: CVPR, pp. 770–778.
  33. Foundation model for advancing healthcare: Challenges, opportunities, and future directions. arXiv preprint arXiv:2404.03264 .
  34. Geometric visual similarity learning in 3d medical image self-supervised pre-training, in: CVPR, pp. 9538–9547.
  35. Towards foundation models learned from anatomy in medical imaging via self-supervision, in: MICCAI, Springer. pp. 94–104.
  36. Label-free liver tumor segmentation, in: CVPR, pp. 7422–7432.
  37. Amos: A large-scale abdominal multi-organ benchmark for versatile medical image segmentation. NeurIPS 35, 36722–36732.
  38. Anatomical invariance modeling and semantic alignment for self-supervised learning in 3d medical image analysis, in: ICCV, pp. 15859–15869.
  39. Label-efficient deep learning in medical image analysis: Challenges and future directions. arXiv .
  40. Hard negative mixing for contrastive learning. NeurIPS 33, 21798–21809.
  41. Scaling laws for neural language models. arXiv preprint arXiv:2001.08361 .
  42. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980 .
  43. Miccai multi-atlas labeling beyond the cranial vault–workshop and challenge, in: MICCAIW, p. 12.
  44. Fusing metadata and dermoscopy images for skin disease diagnosis, in: ISBI, IEEE. pp. 1996–2000.
  45. Mst: Masked self-supervised transformer for visual representation. Advances in Neural Information Processing Systems 34, 13165–13176.
  46. Mixmae: Mixed and masked autoencoder for efficient pretraining of hierarchical vision transformers, in: CVPR, pp. 6252–6261.
  47. Clip-driven universal model for organ segmentation and tumor detection, in: ICCV, pp. 21152–21164.
  48. Swin transformer: Hierarchical vision transformer using shifted windows, in: CVPR, pp. 10012–10022.
  49. Conversion between ct and mri images using diffusion and score-matching models. arXiv preprint arXiv:2209.12104 .
  50. Unleashing the strengths of unlabeled data in pan-cancer abdominal organ quantification: the flare22 challenge. arXiv preprint arXiv:2308.05862 .
  51. 3d mri brain tumor segmentation using autoencoder regularization, in: MICCAIW, Springer. pp. 311–320.
  52. Context encoders: Feature learning by inpainting, in: CVPR, pp. 2536–2544.
  53. Learning transferable visual models from natural language supervision, in: ICML, PMLR. pp. 8748–8763.
  54. Study of thoracic ct in covid-19: the stoic project. Radiology 301, E361–E370.
  55. U-net: Convolutional networks for biomedical image segmentation, in: MICCAI, Springer. pp. 234–241.
  56. Rapid artificial intelligence solutions in a pandemic—the covid-19-20 lung ct lesion segmentation challenge. MIA 82, 102605.
  57. Deep learning in medical image analysis. Annual review of biomedical engineering 19, 221–248.
  58. A large annotated medical image dataset for the development and evaluation of segmentation algorithms. arXiv .
  59. The effectiveness of mae pre-pretraining for billion-scale pretraining. arXiv preprint arXiv:2303.13496 .
  60. Embracing imperfect datasets: A review of deep learning solutions for medical image segmentation. MIA 63, 101693.
  61. 3d self-supervised methods for medical imaging. NeurIPS 33, 18158–18172.
  62. Self-supervised pre-training of swin transformers for 3d medical image analysis, in: CVPR, pp. 20730–20740.
  63. Revisiting rubik’s cube: self-supervised learning with volume-wise transformation for 3d medical image segmentation, in: MICCAI, Springer. pp. 238–248.
  64. Videomae: Masked autoencoders are data-efficient learners for self-supervised video pre-training. NeurIPS 35, 10078–10093.
  65. Stacked denoising autoencoders: Learning useful representations in a deep network with a local denoising criterion. Journal of machine learning research 11.
  66. Hard patches mining for masked image modeling, in: CVPR, pp. 10375–10385.
  67. Masked image modeling with local multi-scale reconstruction, in: CVPR, pp. 2122–2131.
  68. Videomae v2: Scaling video masked autoencoders with dual masking, in: CVPR, pp. 14549–14560.
  69. Fremim: Fourier transform meets masked image modeling for medical image segmentation, in: WACV, pp. 7860–7870.
  70. Swinmm: masked multi-view with swin transformers for 3d medical image segmentation, in: MICCAI, Springer. pp. 486–496.
  71. Totalsegmentator: robust segmentation of 104 anatomical structures in ct images. Radiology: Artificial Intelligence .
  72. Self-supervised pre-training with contrastive and masked autoencoder methods for dealing with small datasets in deep learning for medical imaging. Scientific Reports 13, 20260.
  73. Towards generalist foundation model for radiology. arXiv preprint arXiv:2308.02463 .
  74. Querying labeled for unlabeled: Cross-image semantic consistency guided semi-supervised semantic segmentation. IEEE T-PAMI .
  75. Voco: A simple-yet-effective volume contrastive learning framework for 3d medical image analysis, in: arXiv.
  76. Delving into masked autoencoders for multi-label thorax disease classification, in: WACV, pp. 3588–3600.
  77. Pgl: prior-guided local self-supervised learning for 3d medical image segmentation. arXiv .
  78. Unimiss: Universal medical self-supervised learning via breaking dimensionality barrier, in: ECCV, Springer. pp. 558–575.
  79. SimMIM: A simple framework for masked image modeling, in: CVPR, pp. 9653–9663.
  80. Desd: Self-supervised learning with deep self-distillation for 3d medical image segmentation, in: MICCAI, Springer. pp. 545–555.
  81. Dive into the details of self-supervised learning for medical image analysis. MIA 89, 102879.
  82. Clinically applicable ai system for accurate diagnosis, quantitative measurements, and prognosis of covid-19 pneumonia using computed tomography. Cell 181, 1423–1433.
  83. Point-m2ae: multi-scale masked autoencoders for hierarchical point cloud pre-training. NeurIPS 35, 27061–27074.
  84. Leverage your local and global representations: A new self-supervised learning strategy, in: CVPR, pp. 16580–16589.
  85. Positional label for self-supervised vision transformer, in: AAAI, pp. 3516–3524.
  86. Unsupervised 3d end-to-end medical image registration with volume tweening network. JBHI 24, 1394–1404.
  87. Unsupervised contrastive learning of radiomics and deep features for label-efficient tumor classification, in: MICCAI, Springer. pp. 252–261.
  88. nnformer: Interleaved transformer for volumetric segmentation. arXiv .
  89. A unified visual information preservation framework for self-supervised pre-training in medical image analysis. IEEE T-PAMI .
  90. Preservational learning improves self-supervised medical image models by reconstructing diverse contexts, in: ICCV, pp. 3499–3509.
  91. Models genesis. MIA 67, 101840.
  92. Rubik’s cube+: A self-supervised feature learning framework for 3d medical image analysis. MIA 64, 101746.
  93. Class attention to regions of lesion for imbalanced medical image recognition. Neurocomputing 555, 126577.
  94. Advancing volumetric medical image segmentation via global-local masked autoencoder. arXiv .
  95. Multivariate mixture model for myocardial segmentation combining multi-source images. IEEE T-PAMI 41, 2933–2946.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (6)
  1. Jiaxin Zhuang (14 papers)
  2. Linshan Wu (11 papers)
  3. Qiong Wang (58 papers)
  4. Varut Vardhanabhuti (15 papers)
  5. Lin Luo (27 papers)
  6. Hao Chen (1005 papers)
Citations (1)