Zero-Shot Medical Phrase Grounding with Off-the-shelf Diffusion Models (2404.12920v2)
Abstract: Localizing the exact pathological regions in a given medical scan is an important imaging problem that requires a large amount of bounding box ground truth annotations to be accurately solved. However, there exist alternative, potentially weaker, forms of supervision, such as accompanying free-text reports, which are readily available. The task of performing localization with textual guidance is commonly referred to as phrase grounding. In this work, we use a publicly available Foundation Model, namely the Latent Diffusion Model, to solve this challenging task. This choice is supported by the fact that the Latent Diffusion Model, despite being generative in nature, contains mechanisms (cross-attention) that implicitly align visual and textual features, thus leading to intermediate representations that are suitable for the task at hand. In addition, we aim to perform this task in a zero-shot manner, i.e., without any further training on target data, meaning that the model's weights remain frozen. To this end, we devise strategies to select features and also refine them via post-processing without extra learnable parameters. We compare our proposed method with state-of-the-art approaches which explicitly enforce image-text alignment in a joint embedding space via contrastive learning. Results on a popular chest X-ray benchmark indicate that our method is competitive wih SOTA on different types of pathology, and even outperforms them on average in terms of two metrics (mean IoU and AUC-ROC). Source code will be released upon acceptance.
- “On the opportunities and risks of foundation models” In arXiv preprint arXiv:2108.07258, 2021
- “Your diffusion model is secretly a zero-shot classifier” In arXiv preprint arXiv:2303.16203, 2023
- “Text-to-Image Diffusion Models are Zero Shot Classifiers” In Advances in Neural Information Processing Systems 36, 2024
- “Label-efficient semantic segmentation with diffusion models” In arXiv preprint arXiv:2112.03126, 2021
- “Making the most of text semantics to improve biomedical vision–language processing” In European conference on computer vision, 2022, pp. 1–21 Springer
- “Unleashing text-to-image diffusion models for visual perception” In arXiv preprint arXiv:2303.02153, 2023
- “Vilbert: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks” In Advances in neural information processing systems 32, 2019
- “Mdetr-modulated detection for end-to-end multi-modal understanding” In Proceedings of the IEEE/CVF International Conference on Computer Vision, 2021, pp. 1780–1790
- “Grounded language-image pre-training” In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 10965–10975
- “Language-Guided Diffusion Model for Visual Grounding” In arXiv preprint arXiv:2308.09599, 2023
- “Improving pneumonia localization via cross-attention on medical images and reports” In Medical Image Computing and Computer Assisted Intervention–MICCAI 2021: 24th International Conference, Strasbourg, France, September 27–October 1, 2021, Proceedings, Part II 24, 2021, pp. 571–581 Springer
- “Medical phrase grounding with region-phrase context contrastive alignment” In International Conference on Medical Image Computing and Computer-Assisted Intervention, 2023, pp. 371–381 Springer
- “Learning A Multi-Task Transformer Via Unified And Customized Instruction Tuning For Chest Radiograph Interpretation” In arXiv preprint arXiv:2311.01092, 2023
- Gefen Dawidowicz, Elad Hirsch and Ayellet Tal “LIMITR: Leveraging Local Information for Medical Image-Text Representation” In arXiv preprint arXiv:2303.11755, 2023
- “Learning to exploit temporal structure for biomedical vision-language processing” In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 15016–15027
- “Trade-offs in Fine-tuned Diffusion Models Between Accuracy and Interpretability”, 2023 arXiv:2303.17908 [cs.CV]
- “High-resolution image synthesis with latent diffusion models” In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2022, pp. 10684–10695
- “Brain imaging generation with latent diffusion models” In MICCAI Workshop on Deep Generative Models, 2022, pp. 117–126 Springer
- “MIMIC-CXR-JPG, a large publicly available database of labeled chest radiographs” In arXiv preprint arXiv:1901.07042, 2019
- “Generative ai for medical imaging: extending the monai framework” In arXiv preprint arXiv:2307.15208, 2023
- Ping-Sung Liao, Tse-Sheng Chen and Pau-Choo Chung “A fast algorithm for multilevel thresholding” In J. Inf. Sci. Eng. 17.5, 2001, pp. 713–727
- “RadEdit: stress-testing biomedical vision models via diffusion image editing” In arXiv preprint arXiv:2312.12865, 2023
- Konstantinos Vilouras (4 papers)
- Pedro Sanchez (20 papers)
- Alison Q. O'Neil (23 papers)
- Sotirios A. Tsaftaris (100 papers)