PM2: A New Prompting Multi-modal Model Paradigm for Few-shot Medical Image Classification (2404.08915v2)
Abstract: Few-shot learning has been successfully applied to medical image classification as only very few medical examples are available for training. Due to the challenging problem of limited number of annotated medical images, image representations should not be solely derived from a single image modality which is insufficient for characterizing concept classes. In this paper, we propose a new prompting multi-modal model paradigm on medical image classification based on multi-modal foundation models, called PM2. Besides image modality,PM2 introduces another supplementary text input, known as prompt, to further describe corresponding image or concept classes and facilitate few-shot learning across diverse modalities. To better explore the potential of prompt engineering, we empirically investigate five distinct prompt schemes under the new paradigm. Furthermore, linear probing in multi-modal models acts as a linear classification head taking as input only class token, which ignores completely merits of rich statistics inherent in high-level visual tokens. Thus, we alternatively perform a linear classification on feature distribution of visual tokens and class token simultaneously. To effectively mine such rich statistics, a global covariance pooling with efficient matrix power normalization is used to aggregate visual tokens. Then we study and combine two classification heads. One is shared for class token of image from vision encoder and prompt representation encoded by text encoder. The other is to classification on feature distribution of visual tokens from vision encoder. Extensive experiments on three medical datasets show that our PM2 significantly outperforms counterparts regardless of prompt schemes and achieves state-of-the-art performance.
- Bach: Grand challenge on breast cancer histology images. Medical image analysis, 56:122–139, 2019.
- Visual prompting via image inpainting. Advances in Neural Information Processing Systems, 35:25005–25017, 2022.
- Improved few-shot visual classification. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 14493–14502, 2020.
- Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901, 2020.
- Momentum contrastive learning for few-shot covid-19 diagnosis from chest ct images. Pattern recognition, 113:107826, 2021.
- Enhanced performance of brain tumor classification via tumor region augmentation and partition. PloS one, 10(10):e0140381, 2015.
- Not-so-supervised: a survey of semi-supervised, multi-instance, and transfer learning in medical image analysis. Medical image analysis, 54:280–296, 2019.
- Pfemed: Few-shot medical image classification using prior guided feature enhancement. Pattern Recognition, 134:109108, 2023.
- Deep learning based method for computer aided diagnosis of diabetic retinopathy. In 2019 IEEE International Conference on Imaging Systems and Techniques (IST), pages 1–4. IEEE, 2019.
- Ftranscnn: Fusing transformer and a cnn based on fuzzy logic for uncertain medical image segmentation. Information Fusion, page 101880, 2023.
- An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929, 2020.
- Potential, challenges and future directions for deep learning in prognostics and health management applications. Engineering Applications of Artificial Intelligence, 92:103678, 2020.
- Model-agnostic meta-learning for fast adaptation of deep networks. In International conference on machine learning, pages 1126–1135. PMLR, 2017.
- Clip-adapter: Better vision-language models with feature adapters. arXiv preprint arXiv:2110.04544, 2021a.
- Temporal-attentive covariance pooling networks for video recognition. Advances in Neural Information Processing Systems, 34:13587–13598, 2021b.
- Fp-cnn: Fuzzy pooling-based convolutional neural network for lung ultrasound image classification with explainable ai. Computers in Biology and Medicine, 165:107407, 2023.
- Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016.
- Momentum contrast for unsupervised visual representation learning. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 9729–9738, 2020.
- Scaling up visual and vision-language representation learning with noisy text supervision. In International conference on machine learning, pages 4904–4916. PMLR, 2021.
- Visual prompt tuning. In European Conference on Computer Vision, pages 709–727. Springer, 2022.
- Multi-learner based deep meta-learning for few-shot medical image classification. IEEE Journal of Biomedical and Health Informatics, 27(1):17–28, 2022.
- Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
- Towards evaluating explanations of vision transformers for medical imaging. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 3725–3731, 2023.
- Transfer learning techniques for medical image analysis: A review. Biocybernetics and Biomedical Engineering, 42(1):79–107, 2022.
- Breast cancer histopathological image classification based on deep second-order pooling network. In 2020 International Joint Conference on Neural Networks (IJCNN), pages 1–7. IEEE, 2020.
- Is second-order information helpful for large-scale visual recognition? In ICCV, pages 2070–2078, 2017.
- Towards faster training of global covariance pooling networks by iterative matrix square root normalization. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 947–955, 2018.
- A new perspective to boost vision transformer for medical image classification. arXiv preprint arXiv:2301.00989, 2023.
- Bilinear CNN models for fine-grained visual recognition. In ICCV, pages 1449–1457, 2015.
- Multimodality helps unimodality: Cross-modal few-shot learning with multimodal models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 19325–19337, 2023.
- A survey on deep learning in medical image analysis. Medical image analysis, 42:60–88, 2017.
- Pre-train, prompt, and predict: A systematic survey of prompting methods in natural language processing. ACM Computing Surveys, 55(9):1–35, 2023.
- Prompt distribution learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 5206–5215, 2022.
- Medvit: a robust vision transformer for generalized medical image classification. Computers in Biology and Medicine, 157:106791, 2023.
- Few shot learning for medical imaging: A comparative analysis of methodologies and formal mathematical framework. arXiv preprint arXiv:2305.04401, 2023.
- Ai in medical imaging informatics: current challenges and future directions. IEEE journal of biomedical and health informatics, 24(7):1837–1857, 2020.
- Few-shot learning for dermatological disease diagnosis. In Machine Learning for Healthcare Conference, pages 532–552. PMLR, 2019.
- Low-shot learning with imprinted weights. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 5822–5830, 2018.
- Medical image segmentation using deep semantic-based methods: A review of techniques, applications and emerging trends. Information Fusion, 90:316–352, 2023.
- Improving language understanding by generative pre-training. 2018.
- Language models are unsupervised multitask learners. OpenAI blog, 1(8):9, 2019.
- Learning transferable visual models from natural language supervision. In International conference on machine learning, pages 8748–8763. PMLR, 2021.
- Transformers in medical imaging: A survey. Medical Image Analysis, page 102802, 2023.
- Deep learning in medical image analysis. Annual review of biomedical engineering, 19:221–248, 2017.
- Flava: A foundational language and vision alignment model. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 15638–15650, 2022.
- Prototypical networks for few-shot learning. Advances in neural information processing systems, 30, 2017.
- Review of large vision models and visual prompt engineering. arXiv preprint arXiv:2307.00855, 2023a.
- Omnivl: One foundation model for image-language and video-language tasks. Advances in neural information processing systems, 35:5696–5710, 2022a.
- Second-order multi-instance learning model for whole slide image classification. Physics in Medicine & Biology, 66(14):145006, 2021a.
- Raid-g: Robust estimation of approximate infinite dimensional gaussian with application to material recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 4433–4441, 2016.
- Dropcov: a simple yet effective method for improving deep architectures. Advances in Neural Information Processing Systems, 35:33576–33588, 2022b.
- Towards a deeper understanding of global covariance pooling in deep learning: An optimization perspective. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2023b.
- Deep virtual adversarial self-training with consistency regularization for semi-supervised medical image classification. Medical image analysis, 70:102010, 2021b.
- Simvlm: Simple visual language model pretraining with weak supervision. arXiv preprint arXiv:2108.10904, 2021c.
- Condition-number-regularized covariance estimation. Journal of the Royal Statistical Society Series B: Statistical Methodology, 75(3):427–450, 2013.
- Sot: Delving deeper into classification head for transformer. arXiv preprint arXiv:2104.10935, 2021.
- Joint distribution matters: Deep brownian distance covariance for few-shot classification. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 7972–7981, 2022.
- Deep convolutional neural network based medical image classification for disease diagnosis. Journal of Big data, 6(1):1–18, 2019.
- Unified contrastive learning in image-text-label space. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 19163–19173, 2022.
- Coca: Contrastive captioners are image-text foundation models. arXiv preprint arXiv:2205.01917, 2022.
- An effective cnn and transformer complementary network for medical image segmentation. Pattern Recognition, 136:109228, 2023.
- Tip-adapter: Training-free clip-adapter for better vision-language modeling. arXiv preprint arXiv:2111.03930, 2021.
- Prompt, generate, then cache: Cascade of foundation models makes strong few-shot learners. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 15211–15222, 2023.
- Bubogpt: Enabling visual grounding in multi-modal llms. arXiv preprint arXiv:2307.08581, 2023.
- Conditional prompt learning for vision-language models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 16816–16825, 2022a.
- Learning to prompt for vision-language models. International Journal of Computer Vision, 130(9):2337–2348, 2022b.
- Self pre-training with masked autoencoders for medical image classification and segmentation. In 2023 IEEE 20th International Symposium on Biomedical Imaging (ISBI), pages 1–6. IEEE, 2023a.
- Medical image classification using light-weight cnn with spiking cortical model based attention module. IEEE Journal of Biomedical and Health Informatics, 27(4):1991–2002, 2023b.
- Breast cancer histopathology image classification based on dual-stream high-order network. Biomedical Signal Processing and Control, 78:104007, 2022.