Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
125 tokens/sec
GPT-4o
53 tokens/sec
Gemini 2.5 Pro Pro
42 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

VoCo: A Simple-yet-Effective Volume Contrastive Learning Framework for 3D Medical Image Analysis (2402.17300v2)

Published 27 Feb 2024 in eess.IV

Abstract: Self-Supervised Learning (SSL) has demonstrated promising results in 3D medical image analysis. However, the lack of high-level semantics in pre-training still heavily hinders the performance of downstream tasks. We observe that 3D medical images contain relatively consistent contextual position information, i.e., consistent geometric relations between different organs, which leads to a potential way for us to learn consistent semantic representations in pre-training. In this paper, we propose a simple-yet-effective Volume Contrast (VoCo) framework to leverage the contextual position priors for pre-training. Specifically, we first generate a group of base crops from different regions while enforcing feature discrepancy among them, where we employ them as class assignments of different regions. Then, we randomly crop sub-volumes and predict them belonging to which class (located at which region) by contrasting their similarity to different base crops, which can be seen as predicting contextual positions of different sub-volumes. Through this pretext task, VoCo implicitly encodes the contextual position priors into model representations without the guidance of annotations, enabling us to effectively improve the performance of downstream tasks that require high-level semantics. Extensive experimental results on six downstream tasks demonstrate the superior effectiveness of VoCo. Code will be available at https://github.com/Luffy03/VoCo.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (73)
  1. Michela Antonelli et al. The medical segmentation decathlon. Nature Commun., 13(1):4128, 2022.
  2. Shekoofeh Azizi et al. Big self-supervised models advance medical image classification. In ICCV, pages 3478–3488, 2021.
  3. Vicregl: Self-supervised learning of local visual features. NIPS, 35:8799–8810, 2022.
  4. Patrick Bilic et al. The liver tumor segmentation benchmark (lits). Medical Image Analy., 84:102680, 2023.
  5. Deep clustering for unsupervised learning of visual features. In ECCV, pages 132–149, 2018.
  6. Mathilde Caron et al. Unsupervised learning of visual features by contrasting cluster assignments. NIPS, 33:9912–9924, 2020.
  7. Mathilde Caron et al. Emerging properties in self-supervised vision transformers. In ICCV, pages 9650–9660, 2021.
  8. Location-aware self-supervised transformers. arXiv preprint arXiv:2212.02400, 2022.
  9. Jigsaw clustering for unsupervised visual representation learning. In CVPR, pages 11526–11535, 2021.
  10. Ting Chen et al. A simple framework for contrastive learning of visual representations. In ICML, pages 1597–1607, 2020.
  11. Exploring simple siamese representation learning. In CVPR, pages 15750–15758, 2021.
  12. An empirical study of training self-supervised vision transformers. arXiv preprint arXiv:2104.02057, 2021.
  13. Zekai Chen et al. Masked image modeling advances 3d medical image analysis. In WACV, pages 1970–1980, 2023.
  14. The cancer imaging archive (tcia): maintaining and operating a public information repository. Jour. of Dig. Imag., 26:1045–1057, 2013.
  15. Jiequan Cui et al. Parametric contrastive learning. In ICCV, pages 715–724, 2021.
  16. Jiequan Cui et al. Generalized parametric contrastive learning. IEEE Trans. Pattern Analy. Mach. Intell., 2023.
  17. Unsupervised visual representation learning by context prediction. In ICCV, pages 1422–1430, 2015.
  18. Alexey Dosovitskiy et al. An image is worth 16x16 words: Transformers for image recognition at scale. ICLR, 2020.
  19. Weakly-supervised 3d medical image segmentation using geometric prior and contrastive similarity. IEEE Trans. Medi. Imag., 2023.
  20. Disco: Remedying self-supervised learning on lightweight models with distilled contrastive learning. In ECCV, pages 237–253, 2022.
  21. Florin-Cristian Ghesu et al. Multi-scale deep reinforcement learning for real-time 3d-landmark detection in ct scans. IEEE Trans. Pattern Anal. Mach. Intell., 41(1):176–189, 2017.
  22. Jean-Bastien Grill et al. Bootstrap your own latent-a new approach to self-supervised learning. NIPS, 33:21271–21284, 2020.
  23. Katharina Grünberg et al. Annotating medical image data. Medical Image Analy., pages 45–67, 2017.
  24. Fatemeh Haghighi et al. Transferable visual words: Exploiting the semantics of anatomical patterns for self-supervised learning. IEEE Trans. Medical Imag., 40(10):2857–2868, 2021.
  25. Fatemeh Haghighi et al. Dira: Discriminative, restorative, and adversarial learning for self-supervised medical image analysis. In CVPR, pages 20824–20834, 2022.
  26. Ali Hatamizadeh et al. Swin unetr: Swin transformers for semantic segmentation of brain tumors in mri images. In MICCAIW, pages 272–284, 2021.
  27. Ali Hatamizadeh et al. Unetr: Transformers for 3d medical image segmentation. In WACV, pages 574–584, 2022.
  28. Kaiming He et al. Masked autoencoders are scalable vision learners. In CVPR, pages 16000–16009, 2022.
  29. Momentum contrast for unsupervised visual representation learning. In CVPR, pages 9729–9738, 2020.
  30. Deep residual learning for image recognition. In CVPR, pages 770–778, 2016.
  31. Intra-and inter-slice contrastive learning for point supervised oct fluid segmentation. IEEE Trans. Image Process., 31:1870–1881, 2022.
  32. Yuting He et al. Geometric visual similarity learning in 3d medical image self-supervised pre-training. In CVPR, pages 9538–9547, 2023.
  33. nnu-net: a self-configuring method for deep learning-based biomedical image segmentation. Nature Methods, 18(2):203–211, 2021.
  34. Yankai Jiang et al. Anatomical invariance modeling and semantic alignment for self-supervised learning in 3d medical image analysis. In ICCV, pages 15859–15869, 2023.
  35. Bennett Landman et al. Miccai multi-atlas labeling beyond the cranial vault–workshop and challenge. In MICCAIW, volume 5, page 12, 2015.
  36. Jie Liu et al. Clip-driven universal model for organ segmentation and tumor detection. In ICCV, pages 21152–21164, 2023.
  37. Qiang Liu et al. A multi-level label-aware semi-supervised framework for remote sensing scene classification. IEEE Trans. Geosci. Remote Sens., 2023.
  38. Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101, 2017.
  39. Jun Ma et al. Abdomenct-1k: Is abdominal organ segmentation a solved problem? IEEE Trans. Pattern Anal. Mach. Intell., 44(10):6695–6714, 2021.
  40. Improvements to context based self-supervised learning. In CVPR, pages 9339–9348, 2018.
  41. Nguyen et al. Joint self-supervised image-volume representation learning with intra-inter contrastive clustering. In AAAI, pages 14426–14435, 2023.
  42. Unsupervised learning of visual representations by solving jigsaw puzzles. In ECCV, pages 69–84. Springer, 2016.
  43. Representation learning with contrastive predictive coding. arXiv preprint arXiv:1807.03748, 2018.
  44. Maxime Oquab et al. Dinov2: Learning robust visual features without supervision. arXiv preprint arXiv:2304.07193, 2023.
  45. U-net: Convolutional networks for biomedical image segmentation. In MICCAI, pages 234–241, 2015.
  46. Rodrigo Santa Cruz et al. Deeppermnet: Visual permutation learning. In CVPR, pages 3949–3957, 2017.
  47. Arnaud Arindra Adiyoso Setio et al. Validation, comparison, and combination of algorithms for automatic detection of pulmonary nodules in computed tomography images: the luna16 challenge. Medical Image Analy., 42:1–13, 2017.
  48. Amber L Simpson et al. A large annotated medical image dataset for the development and evaluation of segmentation algorithms. arXiv preprint arXiv:1902.09063, 2019.
  49. Aiham Taleb et al. 3d self-supervised methods for medical imaging. NIPS, 33:18158–18172, 2020.
  50. Yucheng Tang et al. Self-supervised pre-training of swin transformers for 3d medical image analysis. In CVPR, pages 20730–20740, 2022.
  51. Xing Tao et al. Revisiting rubik’s cube: self-supervised learning with volume-wise transformation for 3d medical image segmentation. In MICCAI, pages 238–248, 2020.
  52. Guotai Wang et al. Deepigeos: a deep interactive geodesic framework for medical image segmentation. IEEE Trans. Pattern Anal. Mach. Intell., 41(7):1559–1572, 2018.
  53. Xiaosong Wang et al. Chestx-ray8: Hospital-scale chest x-ray database and benchmarks on weakly-supervised classification and localization of common thorax diseases. In CVPR, pages 2097–2106, 2017.
  54. Yiqing Wang et al. Swinmm: masked multi-view with swin transformers for 3d medical image segmentation. In MICCAI, 2023.
  55. Xin Wen et al. Self-supervised visual representation learning with semantic grouping. NIPS, 35:16423–16438, 2022.
  56. Querying labeled for unlabeled: Cross-image semantic consistency guided semi-supervised semantic segmentation. IEEE Trans. Pattern Anal. Mach. Intell., 45(7):8827–8844, Jul. 2023.
  57. Deep bilateral filtering network for point-supervised semantic segmentation in remote sensing images. IEEE Trans. Image Process., 31:7419–7434, 2022.
  58. Deep covariance alignment for domain adaptive remote sensing image segmentation. IEEE Trans. Geosci. Remote Sens., 60:1–11, 2022.
  59. Sparsely annotated semantic segmentation with adaptive gaussian mixtures. In CVPR, pages 15454–15464, 2023.
  60. Unimiss: Universal medical self-supervised learning via breaking dimensionality barrier. In ECCV, pages 558–575, 2022.
  61. Zhenda Xie et al. Simmim: A simple framework for masked image modeling. In CVPR, pages 9653–9663, 2022.
  62. Shuangfei Zhai et al. Position prediction as an effective pretraining strategy. arXiv preprint arXiv:2207.07611, 2022.
  63. Dive into the details of self-supervised learning for medical image analysis. Medical Image Anal., 89:102879, 2023.
  64. Hongyi Zhang et al. Mixup: Beyond empirical risk minimization. In ICLR, 2018.
  65. Kang Zhang et al. Clinically applicable ai system for accurate diagnosis, quantitative measurements, and prognosis of covid-19 pneumonia using computed tomography. Cell, 181(6):1423–1433, 2020.
  66. Positional label for self-supervised vision transformer. In AAAI, pages 3516–3524, 2023.
  67. Hong-Yu Zhou et al. Comparing to learn: Surpassing imagenet pretraining on radiographs by comparing image representations. In MICCAI, pages 398–407, 2020.
  68. Hong-Yu Zhou et al. Preservational learning improves self-supervised medical image models by reconstructing diverse contexts. In ICCV, pages 3499–3509, 2021.
  69. Hong-Yu Zhou et al. A unified visual information preservation framework for self-supervised pre-training in medical image analysis. IEEE Trans. Pattern Anal. Mach. Intell., 2023.
  70. Zongwei Zhou et al. Models genesis. Medical Image Analy., 67:101840, 2021.
  71. Advancing volumetric medical image segmentation via global-local masked autoencoder. arXiv preprint arXiv:2306.08913, 2023.
  72. Xiahai Zhuang. Multivariate mixture model for myocardial segmentation combining multi-source images. IEEE Trans. Pattern Analy. Mach. Intell., 41(12):2933–2946, 2018.
  73. Xinrui Zhuang et al. Self-supervised feature learning for 3d medical images by playing a rubik’s cube. In MICCAI, pages 420–428, 2019.
Citations (12)

Summary

  • The paper introduces VoCo, a contrastive learning framework that enhances 3D segmentation accuracy through flexible overlap supervision.
  • It employs bi-level regularization and average log-L1 distance evaluation to ensure stable and robust performance.
  • Evaluations on BTCV, Flare23, and Amos22 show up to 3% improvements in DSC and NSD, streamlining clinical integration.

Overview of "VoCo: A Simple-yet-Effective Volume Contrastive Learning Framework for 3D Medical Image Analysis"

The paper "VoCo: A Simple-yet-Effective Volume Contrastive Learning Framework for 3D Medical Image Analysis" introduces VoCo, a novel framework for enhancing contrastive learning in the context of 3D medical image analysis. The authors present a thorough evaluation of their method and compare it against existing strong baselines using several benchmark datasets, notably BTCV, Flare23, and Amos22. This analysis demonstrates the efficacy of VoCo relative to established approaches.

Framework and Methodology

VoCo addresses the challenge of accurately predicting contextual positions within 3D volumes by leveraging contrastive learning techniques. A distinguishing feature of VoCo is the position label generator, which determines overlap area proportions for supervising similarity logits. This method eschews traditional one-to-one correspondence, allowing for more flexible similarity predictions that relate to multiple contextual overlaps concurrently. The authors utilize an average log-L1 distance to evaluate the extent to which these predictions align with ground truth position labels, drawing from prior work that suggests a focus on overall distances, despite occasional feature similarities among negative pairs.

Bi-level regularization of similarity distances between base patches ensures the model's stability and robustness, even when faced with similar patches. The choice of L1 distance for regularization aligns with the goal of milder constraint enforcement, which reflects their quantitative results.

Performance Evaluation

The paper presents results from both online tests and offline validations to substantiate VoCo's performance benefits. Comparison with the baseline method SwinUNETR indicates noticeable improvements, with VoCo achieving superior scores across multiple tasks on the MSD Decathlon benchmark. Specifically, VoCo reports enhanced Dice Similarity Coefficient (DSC) and Normalized Surface Dice (NSD) scores across the evaluated datasets, reflecting its capacity to outperform SwinUNETR in 3D volumetric segmentation tasks. In terms of quantitative improvements, VoCo generally exhibits an increment of up to 2-3% in critical segmentation metrics, which underscores its practical utility in real-world medical imaging applications.

Theoretical and Practical Implications

The theoretical contribution of VoCo lies in its reformed approach to contrastive learning. By relaxing the one-to-one correspondence requirement and adopting a volume-based perspective, the framework expands the potential for 3D spatial information capture. This broadens the application of contrastive learning beyond 2D contexts, providing a basis for future extensions in volumetric data analysis across various domains.

Practically, VoCo's ability to deliver accurate 3D segmentations without relying on complex augmentations or model ensembles simplifies its integration into existing medical imaging workflows. This simplicity and performance advantage could potentially accelerate the adoption of automated systems in clinical settings.

Conclusion and Prospective Developments

The VoCo framework presents a robust alternative to existing 3D medical imaging techniques, balancing simplicity and performance. Given the promising results achieved in this paper, future work may explore further optimization of the contrastive learning framework, potentially integrating more sophisticated regularization strategies or expanding to other kinds of medical imaging data sets. Additionally, the exploration of multi-modal datasets could further cement VoCo’s utility in the broader context of healthcare and medical research.

X Twitter Logo Streamline Icon: https://streamlinehq.com