Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
41 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Freeze the backbones: A Parameter-Efficient Contrastive Approach to Robust Medical Vision-Language Pre-training (2401.01179v1)

Published 2 Jan 2024 in cs.CV, cs.AI, and cs.LG

Abstract: Modern healthcare often utilises radiographic images alongside textual reports for diagnostics, encouraging the use of Vision-Language Self-Supervised Learning (VL-SSL) with large pre-trained models to learn versatile medical vision representations. However, most existing VL-SSL frameworks are trained end-to-end, which is computation-heavy and can lose vital prior information embedded in pre-trained encoders. To address both issues, we introduce the backbone-agnostic Adaptor framework, which preserves medical knowledge in pre-trained image and text encoders by keeping them frozen, and employs a lightweight Adaptor module for cross-modal learning. Experiments on medical image classification and segmentation tasks across three datasets reveal that our framework delivers competitive performance while cutting trainable parameters by over 90% compared to current pre-training approaches. Notably, when fine-tuned with just 1% of data, Adaptor outperforms several Transformer-based methods trained on full datasets in medical image segmentation.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (29)
  1. “Multimodal few-shot learning with frozen language models,” 2021.
  2. “Learning transferable visual models from natural language supervision,” 2021.
  3. “Representation learning with contrastive predictive coding,” 2019.
  4. “A simple framework for contrastive learning of visual representations,” 2020.
  5. “Vl-bert: Pre-training of generic visual-linguistic representations,” 2020.
  6. “Multi-granularity cross-modal alignment for generalized medical visual representation learning,” arXiv preprint arXiv:2210.06044, 2022.
  7. “Joint learning of localized representations from medical images and reports,” in Lecture Notes in Computer Science, pp. 685–701. Springer Nature Switzerland, 2022.
  8. “Contrastive learning of medical visual representations from paired images and text,” 2022.
  9. “Scaling vision transformers,” 2022.
  10. “An empirical study of training self-supervised vision transformers,” 2021.
  11. “All in one: Exploring unified video-language pre-training,” 2022.
  12. “Making the most of text semantics to improve biomedical vision–language processing,” in Lecture Notes in Computer Science, pp. 1–21. Springer Nature Switzerland, 2022.
  13. “Medical image understanding with pretrained vision language models: A comprehensive study,” 2023.
  14. “M-flag: Medical vision-language pre-training with frozen language models and latent space geometry optimization,” 2023.
  15. “On the limits of cross-domain generalization in automated x-ray prediction,” in Medical Imaging with Deep Learning, 2020.
  16. “TorchXRayVision: A library of chest X-ray datasets and models,” in Medical Imaging with Deep Learning, 2022.
  17. “Dinov2: Learning robust visual features without supervision,” 2023.
  18. “Bert: Pre-training of deep bidirectional transformers for language understanding,” 2019.
  19. “Biobert: a pre-trained biomedical language representation model for biomedical text mining,” Bioinformatics, vol. 36, no. 4, pp. 1234–1240, 2020.
  20. “Clinicalbert: Modeling clinical notes and predicting hospital readmission,” 2020.
  21. “Domain-specific language model pretraining for biomedical natural language processing,” 2020.
  22. “The mimic-cxr database,” 2019.
  23. “Rsna pneumonia detection challenge,” 2018.
  24. “Covidx cxr-3,” Jun 2022.
  25. “Siim acr pneumothorax segmentation data,” Jun 2019.
  26. “U-net: Convolutional networks for biomedical image segmentation,” 2015.
  27. “Finding beans in burgers: Deep semantic-visual embedding with localization,” 2018.
  28. “Vse++: Improving visual-semantic embeddings with hard negatives,” 2018.
  29. “Gloria: A multimodal global-local representation learning framework for label-efficient medical image recognition,” in 2021 IEEE/CVF International Conference on Computer Vision (ICCV), Los Alamitos, CA, USA, oct 2021, pp. 3922–3931, IEEE Computer Society.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (5)
  1. Jiuming Qin (1 paper)
  2. Che Liu (59 papers)
  3. Sibo Cheng (36 papers)
  4. Yike Guo (144 papers)
  5. Rossella Arcucci (50 papers)
Citations (2)