Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
97 tokens/sec
GPT-4o
53 tokens/sec
Gemini 2.5 Pro Pro
44 tokens/sec
o3 Pro
5 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Zero-Shot Medical Phrase Grounding with Off-the-shelf Diffusion Models (2404.12920v2)

Published 19 Apr 2024 in cs.CV and cs.LG

Abstract: Localizing the exact pathological regions in a given medical scan is an important imaging problem that requires a large amount of bounding box ground truth annotations to be accurately solved. However, there exist alternative, potentially weaker, forms of supervision, such as accompanying free-text reports, which are readily available. The task of performing localization with textual guidance is commonly referred to as phrase grounding. In this work, we use a publicly available Foundation Model, namely the Latent Diffusion Model, to solve this challenging task. This choice is supported by the fact that the Latent Diffusion Model, despite being generative in nature, contains mechanisms (cross-attention) that implicitly align visual and textual features, thus leading to intermediate representations that are suitable for the task at hand. In addition, we aim to perform this task in a zero-shot manner, i.e., without any further training on target data, meaning that the model's weights remain frozen. To this end, we devise strategies to select features and also refine them via post-processing without extra learnable parameters. We compare our proposed method with state-of-the-art approaches which explicitly enforce image-text alignment in a joint embedding space via contrastive learning. Results on a popular chest X-ray benchmark indicate that our method is competitive wih SOTA on different types of pathology, and even outperforms them on average in terms of two metrics (mean IoU and AUC-ROC). Source code will be released upon acceptance.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (22)
  1. “On the opportunities and risks of foundation models” In arXiv preprint arXiv:2108.07258, 2021
  2. “Your diffusion model is secretly a zero-shot classifier” In arXiv preprint arXiv:2303.16203, 2023
  3. “Text-to-Image Diffusion Models are Zero Shot Classifiers” In Advances in Neural Information Processing Systems 36, 2024
  4. “Label-efficient semantic segmentation with diffusion models” In arXiv preprint arXiv:2112.03126, 2021
  5. “Making the most of text semantics to improve biomedical vision–language processing” In European conference on computer vision, 2022, pp. 1–21 Springer
  6. “Unleashing text-to-image diffusion models for visual perception” In arXiv preprint arXiv:2303.02153, 2023
  7. “Vilbert: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks” In Advances in neural information processing systems 32, 2019
  8. “Mdetr-modulated detection for end-to-end multi-modal understanding” In Proceedings of the IEEE/CVF International Conference on Computer Vision, 2021, pp. 1780–1790
  9. “Grounded language-image pre-training” In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 10965–10975
  10. “Language-Guided Diffusion Model for Visual Grounding” In arXiv preprint arXiv:2308.09599, 2023
  11. “Improving pneumonia localization via cross-attention on medical images and reports” In Medical Image Computing and Computer Assisted Intervention–MICCAI 2021: 24th International Conference, Strasbourg, France, September 27–October 1, 2021, Proceedings, Part II 24, 2021, pp. 571–581 Springer
  12. “Medical phrase grounding with region-phrase context contrastive alignment” In International Conference on Medical Image Computing and Computer-Assisted Intervention, 2023, pp. 371–381 Springer
  13. “Learning A Multi-Task Transformer Via Unified And Customized Instruction Tuning For Chest Radiograph Interpretation” In arXiv preprint arXiv:2311.01092, 2023
  14. Gefen Dawidowicz, Elad Hirsch and Ayellet Tal “LIMITR: Leveraging Local Information for Medical Image-Text Representation” In arXiv preprint arXiv:2303.11755, 2023
  15. “Learning to exploit temporal structure for biomedical vision-language processing” In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 15016–15027
  16. “Trade-offs in Fine-tuned Diffusion Models Between Accuracy and Interpretability”, 2023 arXiv:2303.17908 [cs.CV]
  17. “High-resolution image synthesis with latent diffusion models” In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2022, pp. 10684–10695
  18. “Brain imaging generation with latent diffusion models” In MICCAI Workshop on Deep Generative Models, 2022, pp. 117–126 Springer
  19. “MIMIC-CXR-JPG, a large publicly available database of labeled chest radiographs” In arXiv preprint arXiv:1901.07042, 2019
  20. “Generative ai for medical imaging: extending the monai framework” In arXiv preprint arXiv:2307.15208, 2023
  21. Ping-Sung Liao, Tse-Sheng Chen and Pau-Choo Chung “A fast algorithm for multilevel thresholding” In J. Inf. Sci. Eng. 17.5, 2001, pp. 713–727
  22. “RadEdit: stress-testing biomedical vision models via diffusion image editing” In arXiv preprint arXiv:2312.12865, 2023
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (4)
  1. Konstantinos Vilouras (4 papers)
  2. Pedro Sanchez (20 papers)
  3. Alison Q. O'Neil (23 papers)
  4. Sotirios A. Tsaftaris (100 papers)

Summary

We haven't generated a summary for this paper yet.

X Twitter Logo Streamline Icon: https://streamlinehq.com