Decomposing Disease Descriptions for Enhanced Pathology Detection: A Multi-Aspect Vision-Language Pre-training Framework (2403.07636v4)
Abstract: Medical vision language pre-training (VLP) has emerged as a frontier of research, enabling zero-shot pathological recognition by comparing the query image with the textual descriptions for each disease. Due to the complex semantics of biomedical texts, current methods struggle to align medical images with key pathological findings in unstructured reports. This leads to the misalignment with the target disease's textual representation. In this paper, we introduce a novel VLP framework designed to dissect disease descriptions into their fundamental aspects, leveraging prior knowledge about the visual manifestations of pathologies. This is achieved by consulting a LLM and medical experts. Integrating a Transformer module, our approach aligns an input image with the diverse elements of a disease, generating aspect-centric image representations. By consolidating the matches from each aspect, we improve the compatibility between an image and its associated disease. Additionally, capitalizing on the aspect-oriented representations, we present a dual-head Transformer tailored to process known and unknown diseases, optimizing the comprehensive detection efficacy. Conducting experiments on seven downstream datasets, ours improves the accuracy of recent methods by up to 8.56% and 17.26% for seen and unseen categories, respectively. Our code is released at https://github.com/HieuPhan33/MAVL.
- Society for imaging informatics in medicine: Siim-acr pneumothorax segmentation. https://www.kaggle.com/c/siim-acr-pneumothorax-segmentation, 2019.
- Publicly available clinical BERT embeddings. In Clinical Natural Language Processing Workshop, pages 72–78, Minneapolis, Minnesota, USA, 2019. Association for Computational Linguistics.
- Learning to exploit temporal structure for biomedical vision-language processing. In CVPR, pages 15016–15027, 2023.
- Vlmo: Unified vision-language pre-training with mixture-of-modality-experts. In NeurIPS, pages 32897–32912, 2022.
- Olivier Bodenreider. The unified medical language system (umls): integrating biomedical terminology. Nucleic Acids Research, 32(suppl_1):D267–D270, 2004.
- Making the most of text semantics to improve biomedical vision–language processing. In ECCV, pages 1–21. Springer, 2022.
- Padchest: A large chest x-ray image dataset with multi-label annotated reports. Med. Imag. Analys., 66:101797, 2020.
- End-to-end object detection with transformers. In ECCV, pages 213–229. Springer, 2020.
- Chest imaging representing a covid-19 positive rural us population. Scientific Data, 7(1):414, 2020.
- Resunet-a: A deep learning framework for semantic segmentation of remotely sensed data. ISPRS Journal of Photogrammetry and Remote Sensing, 162:94–114, 2020.
- An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929, 2020.
- Large-scale adversarial training for vision-and-language representation learning. In NeurIPS, pages 6616–6628, 2020.
- Deep residual learning for image recognition. In CVPR, pages 770–778, 2016.
- Gloria: A multimodal global-local representation learning framework for label-efficient medical image recognition. In CVPR, pages 3942–3951, 2021a.
- Seeing out of the box: End-to-end pre-training for vision-language representation learning. In CVPR, pages 12976–12985, 2021b.
- Chexpert: A large chest radiograph dataset with uncertainty labels and expert comparison. In AAAI, pages 590–597, 2019.
- Radgraph: Extracting clinical entities and relations from radiology reports. In NeurIPS, 2021.
- Scaling up visual and vision-language representation learning with noisy text supervision. In International Conference on Machine Learning, pages 4904–4916. PMLR, 2021.
- Mimic-cxr, a de-identified publicly available database of chest radiographs with free-text reports. Scientific Data, 6(1):317, 2019.
- Align before fuse: Vision and language representation learning with momentum distillation. In NeurIPS, pages 9694–9705, 2021a.
- Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In International Conference on Machine Learning, pages 12888–12900. PMLR, 2022.
- Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. arXiv preprint arXiv:2301.12597, 2023.
- UNIMO: Towards unified-modal understanding and generation via cross-modal contrastive learning. pages 2592–2607, 2021b.
- Improving medical vision-language contrastive pretraining with semantics-aware triage. 2023.
- Visual classification via description from large language models. arXiv preprint arXiv:2210.07183, 2023.
- Joint learning of localized representations from medical images and reports. In ECCV, pages 685–701. Springer, 2022.
- Label-free concept bottleneck models. In ICLR, 2023.
- Covid-net cxr-2: An enhanced deep convolutional neural network design for detection of covid-19 cases from chest x-ray images. Frontiers in Medicine, 9:861680, 2022.
- Filtering, distillation, and hard negatives for vision-language pre-training. In CVPR, pages 6967–6977, 2023.
- Learning transferable visual models from natural language supervision. In International Conference on Machine Learning, pages 8748–8763. PMLR, 2021.
- Augmenting the national institutes of health chest radiograph dataset with expert annotations of possible pneumonia. Radiology: Artificial Intelligence, 1(1):e180041, 2019.
- Test-time prompt tuning for zero-shot generalization in vision-language models. In NeurIPS, pages 14274–14289, 2022.
- VL-BERT: Pre-training of generic visual-linguistic representations. In International Conference on Machine Learning, 2020.
- Expert-level detection of pathologies from unannotated chest x-ray images via self-supervised learning. Nature Biomedical Engineering, 6(12):1399–1406, 2022.
- Multi-granularity cross-modal alignment for generalized medical visual representation learning. In NeurIPS, pages 33536–33549, 2022a.
- Ofa: Unifying architectures, tasks, and modalities through a simple sequence-to-sequence learning framework. In International Conference on Machine Learning, pages 23318–23340. PMLR, 2022b.
- Chestx-ray8: Hospital-scale chest x-ray database and benchmarks on weakly-supervised classification and localization of common thorax diseases. In CVPR, pages 2097–2106, 2017.
- Medclip: Contrastive learning from unpaired medical images and text. 2022c.
- MedKLIP: Medical knowledge enhanced language-image pre-training. In ICCV, 2023.
- Learning concise and descriptive attributes for visual recognition. In ICCV, pages 3090–3100, 2023.
- Language in a bottle: Language model guided concept bottlenecks for interpretable image classification. In CVPR, pages 19187–19197, 2023.
- Filip: Fine-grained interactive language-image pre-training. In ICLR, 2022.
- Coca: Contrastive captioners are image-text foundation models. Trans. Mach. Learn. Research, 2022.
- Post-hoc concept bottleneck models. 2023.
- Do vision-language pretrained models learn composable primitive concepts? Trans. Mach. Learn. Research, 2023, 2022.
- Contrastive learning of medical visual representations from paired images and text. In Machine Learning for Healthcare Conference, pages 2–25. PMLR, 2022.
- Yutong Xie (68 papers)
- Yuankai Qi (46 papers)
- Lingqiao Liu (113 papers)
- Liyang Liu (12 papers)
- Bowen Zhang (161 papers)
- Zhibin Liao (21 papers)
- Qi Wu (323 papers)
- Minh-Son To (20 papers)
- Johan W. Verjans (16 papers)
- Vu Minh Hieu Phan (11 papers)