Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
126 tokens/sec
GPT-4o
47 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Improving Medical Multi-modal Contrastive Learning with Expert Annotations (2403.10153v3)

Published 15 Mar 2024 in cs.CV and cs.LG

Abstract: We introduce eCLIP, an enhanced version of the CLIP model that integrates expert annotations in the form of radiologist eye-gaze heatmaps. It tackles key challenges in contrastive multi-modal medical imaging analysis, notably data scarcity and the "modality gap" -- a significant disparity between image and text embeddings that diminishes the quality of representations and hampers cross-modal interoperability. eCLIP integrates a heatmap processor and leverages mixup augmentation to efficiently utilize the scarce expert annotations, thus boosting the model's learning effectiveness. eCLIP is designed to be generally applicable to any variant of CLIP without requiring any modifications of the core architecture. Through detailed evaluations across several tasks, including zero-shot inference, linear probing, cross-modal retrieval, and Retrieval Augmented Generation (RAG) of radiology reports using a frozen LLM, eCLIP showcases consistent improvements in embedding quality. The outcomes reveal enhanced alignment and uniformity, affirming eCLIP's capability to harness high-quality annotations for enriched multi-modal analysis in the medical imaging domain.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (67)
  1. Publicly available clinical BERT embeddings. arXiv preprint arXiv:1904.03323, 2019.
  2. Curriculum learning. In Proceedings of the 26th annual international conference on machine learning, pp.  41–48, 2009.
  3. Reflacx, a dataset of reports and eye-tracking data for localization of abnormalities in chest x-rays. Scientific data, 9(1):350, 2022.
  4. A simple framework for contrastive learning of visual representations. In International conference on machine learning, pp.  1597–1607. PMLR, 2020.
  5. Reproducible scaling laws for contrastive language-image learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.  2818–2829, 2023.
  6. Preparing a collection of radiology examinations for distribution and retrieval. Journal of the American Medical Informatics Association, 23(2):304–310, 2016.
  7. BERT: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805, 2018.
  8. The faiss library. arXiv preprint arXiv:2401.08281, 2024.
  9. SimCSE: Simple contrastive learning of sentence embeddings. arXiv preprint arXiv:2104.08821, 2021.
  10. M3ae: Multimodal masked autoencoders learn transferable representations. Technical report, Technical Report.
  11. CyCLIP: Cyclic contrastive language-image pretraining. Advances in Neural Information Processing Systems, 35:6704–6719, 2022.
  12. I can’t believe there’s no images! learning visual tasks using only language supervision. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp.  2672–2683, 2023.
  13. Umix: Improving importance weighting for subpopulation shift via uncertainty-aware mixup. Advances in Neural Information Processing Systems, 35:37704–37718, 2022.
  14. Masked autoencoders are scalable vision learners. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp.  16000–16009, 2022.
  15. Gloria: A multimodal global-local representation learning framework for label-efficient medical image recognition. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp.  3942–3951, 2021.
  16. A visual–language foundation model for pathology image analysis using medical twitter. Nature medicine, 29(9):2307–2316, 2023.
  17. CheXpert: A large chest radiograph dataset with uncertainty labels and expert comparison. In Proceedings of the AAAI conference on artificial intelligence, volume 33, pp.  590–597, 2019.
  18. Scaling up visual and vision-language representation learning with noisy text supervision. In International conference on machine learning, pp.  4904–4916. PMLR, 2021.
  19. Mistral 7B. arXiv preprint arXiv:2310.06825, 2023.
  20. MIMIC-CXR-JPG-Chest radiographs with structured labels. PhysioNet, 2019a.
  21. MIMIC-CXR, a de-identified publicly available database of chest radiographs with free-text reports. Scientific data, 6(1):317, 2019b.
  22. Creation and validation of a chest X-ray dataset with eye-tracking and report dictation for AI development. Scientific Data, 8(1):92, 2021.
  23. Self-supervised learning in medicine and healthcare. Nature Biomedical Engineering, 6(12):1346–1352, 2022.
  24. Scaling language-image pre-training via masking. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.  23390–23400, 2023.
  25. Mind the gap: Understanding the modality gap in multi-modal contrastive representation learning. Advances in Neural Information Processing Systems, 35:17612–17625, 2022.
  26. Visual instruction tuning. arXiv preprint arXiv:2304.08485, 2023.
  27. UMAP: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426, 2018.
  28. S-clip: Semi-supervised vision-language learning using few specialist captions. Advances in Neural Information Processing Systems, 36, 2024.
  29. Med-flamingo: a multimodal medical few-shot learner. In Machine Learning for Health (ML4H), pp.  353–367. PMLR, 2023.
  30. Slip: Self-supervision meets language-image pre-training. In European Conference on Computer Vision, pp.  529–544. Springer, 2022.
  31. SILC: Improving vision language pretraining with self-distillation. arXiv preprint arXiv:2310.13355, 2023.
  32. Text-only training for image captioning using noise-injected clip. arXiv preprint arXiv:2211.00575, 2022.
  33. Geodesic multi-modal mixup for robust fine-tuning. Advances in Neural Information Processing Systems, 36, 2023.
  34. Representation learning with contrastive predictive coding. arXiv preprint arXiv:1807.03748, 2018.
  35. Tier: Text-image entropy regularization for medical clip-style models. In Machine Learning for Healthcare Conference, pp.  548–564. PMLR, 2023.
  36. Learning transferable visual models from natural language supervision. In International conference on machine learning, pp.  8748–8763. PMLR, 2021.
  37. Zero-shot text-to-image generation. In International Conference on Machine Learning, pp.  8821–8831. PMLR, 2021.
  38. Sentence-BERT: Sentence embeddings using siamese BERT-networks. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pp.  3982–3992, 2019.
  39. Augmenting the national institutes of health chest radiograph dataset with expert annotations of possible pneumonia. Radiology: Artificial Intelligence, 1(1):e180041, 2019.
  40. Flava: A foundational language and vision alignment model. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.  15638–15650, 2022.
  41. Combining automatic labelers and expert annotations for accurate radiology report labeling using BERT. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp.  1500–1519, 2020.
  42. MPNet: Masked and permuted pre-training for language understanding. Advances in Neural Information Processing Systems, 33:16857–16867, 2020.
  43. MoCo-CXR: MoCo pretraining improves representation and transferability of chest X-ray models, 2021. URL https://arxiv. org/abs, 2010.
  44. Alpha-clip: A clip model focusing on wherever you want. arXiv preprint arXiv:2312.03818, 2023.
  45. CLIPPO: Image-and-language understanding from pixels only. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.  11006–11017, 2023.
  46. Towards generalist biomedical ai. NEJM AI, 1(3):AIoa2300138, 2024.
  47. Probabilistic integration of object level annotations in chest x-ray classification. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp.  3630–3640, 2023.
  48. Manifold mixup: Better representations by interpolating hidden states. In International conference on machine learning, pp.  6438–6447. PMLR, 2019.
  49. Towards domain-agnostic contrastive learning. In International Conference on Machine Learning, pp.  10530–10541. PMLR, 2021.
  50. Gazegnn: A gaze-guided graph neural network for chest x-ray classification. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp.  2194–2203, 2024.
  51. Understanding the behaviour of contrastive loss. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp.  2495–2504, 2021.
  52. Understanding contrastive representation learning through alignment and uniformity on the hypersphere. In International Conference on Machine Learning, pp.  9929–9939. PMLR, 2020.
  53. ChestX-Ray8: Hospital-scale chest x-ray database and benchmarks on weakly-supervised classification and localization of common thorax diseases. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp.  2097–2106, 2017.
  54. MedCLIP: Contrastive learning from unpaired medical images and text. arXiv preprint arXiv:2210.10163, 2022.
  55. Masked autoencoding does not help natural language supervision at scale. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.  23432–23444, 2023.
  56. Medklip: Medical knowledge enhanced language-image pre-training. medRxiv, pp.  2023–01, 2023.
  57. Demystifying clip data. arXiv preprint arXiv:2309.16671, 2023a.
  58. Elixr: Towards a general purpose x-ray artificial intelligence system through alignment of large language models and radiology vision encoders. arXiv preprint arXiv:2308.01317, 2023b.
  59. CXR-CLIP: Toward large scale chest X-ray language-image pre-training. In International Conference on Medical Image Computing and Computer-Assisted Intervention, pp.  101–111. Springer, 2023.
  60. Evaluating progress in automatic chest x-ray radiology report generation. Patterns, 4(9), 2023.
  61. Barlow twins: Self-supervised learning via redundancy reduction. In International Conference on Machine Learning, pp.  12310–12320. PMLR, 2021.
  62. mixup: Beyond empirical risk minimization. arXiv preprint arXiv:1710.09412, 2017.
  63. Multi-task paired masking with alignment modeling for medical vision-language pre-training. IEEE Transactions on Multimedia, 2023a.
  64. BERTScore: Evaluating text generation with BERT. In International Conference on Learning Representations, 2019.
  65. Contrastive learning of medical visual representations from paired images and text. In Machine Learning for Healthcare Conference, pp.  2–25. PMLR, 2022.
  66. Diagnosing and rectifying vision models using language. arXiv preprint arXiv:2302.04269, 2023b.
  67. Connect, Collapse, Corrupt: Learning cross-modal tasks with uni-modal data. In The Twelfth International Conference on Learning Representations, 2024.
Citations (5)

Summary

We haven't generated a summary for this paper yet.

X Twitter Logo Streamline Icon: https://streamlinehq.com