UMass-BioNLP at MEDIQA-M3G 2024: DermPrompt -- A Systematic Exploration of Prompt Engineering with GPT-4V for Dermatological Diagnosis (2404.17749v2)
Abstract: This paper presents our team's participation in the MEDIQA-ClinicalNLP2024 shared task B. We present a novel approach to diagnosing clinical dermatology cases by integrating large multimodal models, specifically leveraging the capabilities of GPT-4V under a retriever and a re-ranker framework. Our investigation reveals that GPT-4V, when used as a retrieval agent, can accurately retrieve the correct skin condition 85% of the time using dermatological images and brief patient histories. Additionally, we empirically show that Naive Chain-of-Thought (CoT) works well for retrieval while Medical Guidelines Grounded CoT is required for accurate dermatological diagnosis. Further, we introduce a Multi-Agent Conversation (MAC) framework and show its superior performance and potential over the best CoT strategy. The experiments suggest that using naive CoT for retrieval and multi-agent conversation for critique-based diagnosis, GPT-4V can lead to an early and accurate diagnosis of dermatological conditions. The implications of this work extend to improving diagnostic workflows, supporting dermatological education, and enhancing patient care by providing a scalable, accessible, and accurate diagnostic tool.
- Convolutional neural network assistance significantly improves dermatologists’ diagnosis of cutaneous tumours using clinical images. European Journal of Cancer, 169:156–165.
- Remote health diagnosis and monitoring in the time of covid-19. Physiological measurement, 41(10):10TR01.
- Expertise in nursing practice: Caring, clinical judgment, and ethics. Springer Publishing Company.
- A convolutional neural network trained with dermoscopic images performed on par with 145 dermatologists in a clinical melanoma image classification task. European journal of cancer, 111:148–154.
- Medblip: Bootstrapping language-image pre-training from 3d medical images and texts.
- Meditron-70b: Scaling medical pretraining for large language models.
- A deep learning architecture for image representation, visual interpretability and automated basal-cell carcinoma cancer detection. In Medical Image Computing and Computer-Assisted Intervention – MICCAI 2013, pages 403–410, Berlin, Heidelberg. Springer Berlin Heidelberg.
- Dermatologist-level classification of skin cancer with deep neural networks. Nature, 542:115–118.
- To generate or to retrieve? on the effectiveness of artificial contexts for medical open-domain question answering. arXiv preprint arXiv:2403.01924.
- Skin cancer classification using resnet. In 2020 IEEE 5th International Conference on Computing Communication and Automation (ICCCA), pages 536–541.
- Domain-specific language model pretraining for biomedical natural language processing.
- Pathologist-level classification of histopathological melanoma images with deep neural networks. European Journal of Cancer, 115:79–83.
- Aligner: Achieving efficient alignment through weak-to-strong correction. arXiv preprint arXiv:2402.02416.
- Recognizing basal cell carcinoma on smartphone-captured digital histopathology images with a deep neural network. British Journal of Dermatology, 182(3):754–762.
- Omar Khattab and Matei Zaharia. 2020. Colbert: Efficient and effective passage search via contextualized late interaction over bert. In Proceedings of the 43rd International ACM SIGIR conference on research and development in Information Retrieval, pages 39–48.
- Llava-med: Training a large language-and-vision assistant for biomedicine in one day.
- One is not enough: Multi-agent conversation framework enhances rare disease diagnostic capabilities of large language models.
- BioGPT: generative pre-trained transformer for biomedical text generation and mining. Briefings in Bioinformatics, 23(6). Bbac409.
- Medical school dermatology education: a scoping review. Clinical and Experimental Dermatology, 48(6):648–659.
- Towards accurate differential diagnosis with large language models. arXiv preprint arXiv:2312.00164.
- Performance of GPT-4 Vision on kidney pathology exam questions. American Journal of Clinical Pathology, page aqae030.
- Synfac-edit: Synthetic imitation edit feedback for factual alignment in clinical summarization.
- Capabilities of gpt-4 on medical challenge problems.
- Can generalist foundation models outcompete special-purpose tuning? case study in medicine.
- Dermacen analytica: A novel methodology integrating multi-modal large language models with machine learning in tele-dermatology.
- Grips: Gradient-free, edit-based instruction search for prompting large language models. arXiv preprint arXiv:2203.07281.
- Automatic prompt optimization with" gradient descent" and beam search. arXiv preprint arXiv:2305.03495.
- Sudha Rao and Joel Tetreault. 2018. Dear sir or madam, may i introduce the gyafc dataset: Corpus, benchmarks and metrics for formality style transfer. arXiv preprint arXiv:1803.06535.
- The probabilistic relevance framework: Bm25 and beyond. Foundations and Trends® in Information Retrieval, 3(4):333–389.
- Dermatologist-level classification of skin cancer using cascaded ensembling of convolutional neural network and handcrafted features based deep neural network. IEEE Access, 10:17920–17932.
- Large language models encode clinical knowledge.
- Large language models encode clinical knowledge. Nature, 620(7972):172–180.
- Towards expert-level medical question answering with large language models. arXiv preprint arXiv:2305.09617.
- Context generation improves open domain question answering. arXiv preprint arXiv:2210.06349.
- Chain-of-discussion: A multi-model framework for complex evidence-based question answering.
- Chain-of-discussion: A multi-model framework for complex evidence-based question answering. arXiv preprint arXiv:2402.16313.
- Towards conversational diagnostic ai. arXiv preprint arXiv:2401.05654.
- Skin diseases classification using deep leaning methods. Current Health Sciences Journal, 46:136 – 140.
- Autogen: Enabling next-gen llm applications via multi-agent conversation framework. arXiv preprint arXiv:2308.08155.
- Performance of multimodal gpt-4v on usmle with image: Potential for imaging diagnostic support with explanations. medRxiv.
- Do physicians know how to prompt? the need for automatic prompt optimization help in clinical note generation. arXiv preprint arXiv:2311.09684.
- Zonghai Yao and Hong Yu. 2021. Improving formality style transfer with context-aware rule injection. arXiv preprint arXiv:2106.00210.
- Generate rather than retrieve: Large language models are strong context generators. arXiv preprint arXiv:2209.10063.
- Merging generated and retrieved knowledge for open-domain qa. arXiv preprint arXiv:2310.14393.
- Skingpt-4: An interactive dermatology diagnostic system with visual large language model.
Sponsor
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.
Top Community Prompts
Collections
Sign up for free to add this paper to one or more collections.