Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
41 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

G2D: From Global to Dense Radiography Representation Learning via Vision-Language Pre-training (2312.01522v4)

Published 3 Dec 2023 in cs.CV and cs.LG

Abstract: Recently, medical vision-language pre-training (VLP) has reached substantial progress to learn global visual representation from medical images and their paired radiology reports. However, medical imaging tasks in real world usually require finer granularity in visual features. These tasks include visual localization tasks (e.g., semantic segmentation, object detection) and visual grounding task. Yet, current medical VLP methods face challenges in learning these fine-grained features, as they primarily focus on brute-force alignment between image patches and individual text tokens for local visual feature learning, which is suboptimal for downstream dense prediction tasks. In this work, we propose a new VLP framework, named \textbf{G}lobal to \textbf{D}ense level representation learning (G2D) that achieves significantly improved granularity and more accurate grounding for the learned features, compared to existing medical VLP approaches. In particular, G2D learns dense and semantically-grounded image representations via a pseudo segmentation task parallel with the global vision-language alignment. Notably, generating pseudo segmentation targets does not incur extra trainable parameters: they are obtained on the fly during VLP with a parameter-free processor. G2D achieves superior performance across 6 medical imaging tasks and 25 diseases, particularly in semantic segmentation, which necessitates fine-grained, semantically-grounded image features. In this task, G2D surpasses peer models even when fine-tuned with just 1\% of the training data, compared to the 100\% used by these models. The code can be found in https://github.com/cheliu-computation/G2D-NeurIPS24/tree/main.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (44)
  1. Publicly available clinical bert embeddings. arXiv preprint arXiv:1904.03323, 2019.
  2. Making the most of text semantics to improve biomedical vision–language processing. In European conference on computer vision, pages 1–21. Springer, 2022.
  3. Deep learning in computer vision: A critical review of emerging techniques and application scenarios. Machine Learning with Applications, 6:100134, 2021.
  4. Prior: Prototype representation joint learning from medical images and reports. arXiv preprint arXiv:2307.12577, 2023.
  5. Foreground-background separation through concept distillation from generative image foundation models. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 988–998, 2023.
  6. Deep learning-enabled medical computer vision. NPJ digital medicine, 4(1):1–9, 2021.
  7. Pyramidclip: Hierarchical feature alignment for vision-language model pretraining. Advances in neural information processing systems, 35:35959–35970, 2022.
  8. Masked autoencoders are scalable vision learners. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 16000–16009, 2022.
  9. J Healthcare. Object-cxr-automatic detection of foreign objects on chest x-rays, 2020.
  10. Gloria: A multimodal global-local representation learning framework for label-efficient medical image recognition. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 3942–3951, 2021.
  11. Enhancing representation in radiography-reports foundation model: A granular alignment algorithm using masked contrastive learning. arXiv preprint arXiv:2309.05904, 2023.
  12. Chexpert: A large chest radiograph dataset with uncertainty labels and expert comparison. In Proceedings of the AAAI conference on artificial intelligence, pages 590–597, 2019.
  13. nnu-net: a self-configuring method for deep learning-based biomedical image segmentation. Nature methods, 18(2):203–211, 2021.
  14. Mimic-cxr-jpg, a large publicly available database of labeled chest radiographs. arXiv preprint arXiv:1901.07042, 2019a.
  15. Mimic-cxr-jpg, a large publicly available database of labeled chest radiographs. arXiv preprint arXiv:1901.07042, 2019b.
  16. Grounded language-image pre-training. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 10965–10975, 2022.
  17. Feature pyramid networks for object detection. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 2117–2125, 2017.
  18. M-flag: Medical vision-language pre-training with frozen language models and latent space geometry optimization. In International Conference on Medical Image Computing and Computer-Assisted Intervention, pages 637–647. Springer, 2023a.
  19. Imitate: Clinical prior guided hierarchical vision-language pre-training. arXiv preprint arXiv:2310.07355, 2023b.
  20. Utilizing synthetic data for medical vision-language pre-training: Bypassing the need for real images. arXiv preprint arXiv:2310.07027, 2023c.
  21. Pixmim: Rethinking pixel reconstruction in masked image modeling. arXiv preprint arXiv:2303.02416, 2023d.
  22. Improving pixel-based mim by reducing wasted modeling capability. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 5361–5372, 2023e.
  23. Segclip: Patch aggregation with learnable centers for open-vocabulary semantic segmentation. In International Conference on Machine Learning, pages 23033–23044. PMLR, 2023.
  24. Self-supervision with superpixels: Training few-shot medical image segmentation without annotation. In Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXIX 16, pages 762–780. Springer, 2020.
  25. Learning transferable visual models from natural language supervision. In International Conference on Machine Learning, pages 8748–8763. PMLR, 2021.
  26. Yolov3: An incremental improvement. arXiv preprint arXiv:1804.02767, 2018.
  27. U-net: Convolutional networks for biomedical image segmentation. In Medical Image Computing and Computer-Assisted Intervention–MICCAI 2015: 18th International Conference, Munich, Germany, October 5-9, 2015, Proceedings, Part III 18, pages 234–241. Springer, 2015.
  28. Augmenting the national institutes of health chest radiograph dataset with expert annotations of possible pneumonia. Radiology: Artificial Intelligence, 1(1):e180041, 2019.
  29. Siim-acr pneumothorax segmentation. 2019.
  30. Expert-level detection of pathologies from unannotated chest x-ray images via self-supervised learning. Nature Biomedical Engineering, pages 1–8, 2022.
  31. Bilateral filtering for gray and color images. In Sixth international conference on computer vision (IEEE Cat. No. 98CH36271), pages 839–846. IEEE, 1998.
  32. Representation learning with contrastive predictive coding. ArXiv, abs/1807.03748, 2018.
  33. Attention is all you need. Advances in neural information processing systems, 30, 2017.
  34. Med-unic: Unifying cross-lingual medical vision-language pre-training by diminishing bias. arXiv preprint arXiv:2305.19894, 2023.
  35. Multi-granularity cross-modal alignment for generalized medical visual representation learning. arXiv preprint arXiv:2210.06044, 2022.
  36. Covid-net: A tailored deep convolutional neural network design for detection of covid-19 cases from chest x-ray images. Scientific reports, 10(1):1–12, 2020.
  37. Chestx-ray8: Hospital-scale chest x-ray database and benchmarks on weakly-supervised classification and localization of common thorax diseases. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 2097–2106, 2017.
  38. Medklip: Medical knowledge enhanced language-image pre-training. medRxiv, pages 2023–01, 2023a.
  39. Medklip: Medical knowledge enhanced language-image pre-training. medRxiv, pages 2023–01, 2023b.
  40. Glipv2: Unifying localization and vision-language understanding. Advances in Neural Information Processing Systems, 35:36067–36080, 2022.
  41. Knowledge-enhanced visual-language pre-training on chest radiology images. Nature Communications, 14(1):4542, 2023.
  42. Contrastive learning of medical visual representations from paired images and text. arXiv preprint arXiv:2010.00747, 2020.
  43. Extract free dense labels from clip. In European Conference on Computer Vision, pages 696–712. Springer, 2022.
  44. Advancing radiograph representation learning with masked record modeling. In The Eleventh International Conference on Learning Representations, 2023.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (6)
  1. Che Liu (59 papers)
  2. Cheng Ouyang (60 papers)
  3. Sibo Cheng (36 papers)
  4. Anand Shah (7 papers)
  5. Wenjia Bai (80 papers)
  6. Rossella Arcucci (50 papers)
Citations (6)
X Twitter Logo Streamline Icon: https://streamlinehq.com