Freeze the backbones: A Parameter-Efficient Contrastive Approach to Robust Medical Vision-Language Pre-training (2401.01179v1)
Abstract: Modern healthcare often utilises radiographic images alongside textual reports for diagnostics, encouraging the use of Vision-Language Self-Supervised Learning (VL-SSL) with large pre-trained models to learn versatile medical vision representations. However, most existing VL-SSL frameworks are trained end-to-end, which is computation-heavy and can lose vital prior information embedded in pre-trained encoders. To address both issues, we introduce the backbone-agnostic Adaptor framework, which preserves medical knowledge in pre-trained image and text encoders by keeping them frozen, and employs a lightweight Adaptor module for cross-modal learning. Experiments on medical image classification and segmentation tasks across three datasets reveal that our framework delivers competitive performance while cutting trainable parameters by over 90% compared to current pre-training approaches. Notably, when fine-tuned with just 1% of data, Adaptor outperforms several Transformer-based methods trained on full datasets in medical image segmentation.
- “Multimodal few-shot learning with frozen language models,” 2021.
- “Learning transferable visual models from natural language supervision,” 2021.
- “Representation learning with contrastive predictive coding,” 2019.
- “A simple framework for contrastive learning of visual representations,” 2020.
- “Vl-bert: Pre-training of generic visual-linguistic representations,” 2020.
- “Multi-granularity cross-modal alignment for generalized medical visual representation learning,” arXiv preprint arXiv:2210.06044, 2022.
- “Joint learning of localized representations from medical images and reports,” in Lecture Notes in Computer Science, pp. 685–701. Springer Nature Switzerland, 2022.
- “Contrastive learning of medical visual representations from paired images and text,” 2022.
- “Scaling vision transformers,” 2022.
- “An empirical study of training self-supervised vision transformers,” 2021.
- “All in one: Exploring unified video-language pre-training,” 2022.
- “Making the most of text semantics to improve biomedical vision–language processing,” in Lecture Notes in Computer Science, pp. 1–21. Springer Nature Switzerland, 2022.
- “Medical image understanding with pretrained vision language models: A comprehensive study,” 2023.
- “M-flag: Medical vision-language pre-training with frozen language models and latent space geometry optimization,” 2023.
- “On the limits of cross-domain generalization in automated x-ray prediction,” in Medical Imaging with Deep Learning, 2020.
- “TorchXRayVision: A library of chest X-ray datasets and models,” in Medical Imaging with Deep Learning, 2022.
- “Dinov2: Learning robust visual features without supervision,” 2023.
- “Bert: Pre-training of deep bidirectional transformers for language understanding,” 2019.
- “Biobert: a pre-trained biomedical language representation model for biomedical text mining,” Bioinformatics, vol. 36, no. 4, pp. 1234–1240, 2020.
- “Clinicalbert: Modeling clinical notes and predicting hospital readmission,” 2020.
- “Domain-specific language model pretraining for biomedical natural language processing,” 2020.
- “The mimic-cxr database,” 2019.
- “Rsna pneumonia detection challenge,” 2018.
- “Covidx cxr-3,” Jun 2022.
- “Siim acr pneumothorax segmentation data,” Jun 2019.
- “U-net: Convolutional networks for biomedical image segmentation,” 2015.
- “Finding beans in burgers: Deep semantic-visual embedding with localization,” 2018.
- “Vse++: Improving visual-semantic embeddings with hard negatives,” 2018.
- “Gloria: A multimodal global-local representation learning framework for label-efficient medical image recognition,” in 2021 IEEE/CVF International Conference on Computer Vision (ICCV), Los Alamitos, CA, USA, oct 2021, pp. 3922–3931, IEEE Computer Society.
- Jiuming Qin (1 paper)
- Che Liu (59 papers)
- Sibo Cheng (36 papers)
- Yike Guo (144 papers)
- Rossella Arcucci (50 papers)