Inquire, Interact, and Integrate: A Proactive Agent Collaborative Framework for Zero-Shot Multimodal Medical Reasoning (2405.11640v1)
Abstract: The adoption of LLMs in healthcare has attracted significant research interest. However, their performance in healthcare remains under-investigated and potentially limited, due to i) they lack rich domain-specific knowledge and medical reasoning skills; and ii) most state-of-the-art LLMs are unimodal, text-only models that cannot directly process multimodal inputs. To this end, we propose a multimodal medical collaborative reasoning framework \textbf{MultiMedRes}, which incorporates a learner agent to proactively gain essential information from domain-specific expert models, to solve medical multimodal reasoning problems. Our method includes three steps: i) \textbf{Inquire}: The learner agent first decomposes given complex medical reasoning problems into multiple domain-specific sub-problems; ii) \textbf{Interact}: The agent then interacts with domain-specific expert models by repeating the ``ask-answer'' process to progressively obtain different domain-specific knowledge; iii) \textbf{Integrate}: The agent finally integrates all the acquired domain-specific knowledge to accurately address the medical reasoning problem. We validate the effectiveness of our method on the task of difference visual question answering for X-ray images. The experiments demonstrate that our zero-shot prediction achieves state-of-the-art performance, and even outperforms the fully supervised methods. Besides, our approach can be incorporated into various LLMs and multimodal LLMs to significantly boost their performance.
- A multi-agent deep reinforcement learning approach for enhancement of covid-19 ct image segmentation. Journal of Personalized Medicine.
- Optimized control for medical image segmentation: improved multi-agent systems agreements using particle swarm optimization. Journal of Ambient Intelligence and Humanized Computing.
- A stochastic multi-agent approach for medical-image segmentation: Application to tumor segmentation in brain mr images. Artificial Intelligence in Medicine.
- Clara: Clinical report auto-completion. Proceedings of The Web Conference 2020.
- Language models are few-shot learners. In NIPS.
- Minigpt-v2: large language model as a unified interface for vision-language multi-task learning. arXiv preprint arXiv: 2310.09478.
- Multi-modal masked autoencoders for medical vision-and-language pre-training. In International Conference on Medical Image Computing and Computer-Assisted Intervention.
- Chexagent: Towards a foundation model for chest x-ray interpretation. arXiv preprint arXiv: 2401.12208.
- Difficulties in the interpretation of chest radiography. Comparative interpretation of CT and standard radiography of the chest.
- Multiple meta-model quantifying for medical visual question answering. In MICCAI.
- Neural naturalist: Generating fine-grained image comparisons. In EMNLP.
- KAT: A knowledge augmented transformer for vision-and-language. In Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies.
- Expert knowledge-aware image difference graph representation learning for difference-aware medical visual question answering. KDD.
- Densely connected convolutional networks. CVPR.
- Chexpert: A large chest radiograph dataset with uncertainty labels and expert comparison. AAAI.
- Harsh Jhamtani and Taylor Berg-Kirkpatrick. 2018. Learning to describe differences between pairs of similar images. In EMNLP.
- Mimic-cxr, a de-identified publicly available database of chest radiographs with free-text reports. Scientific Data.
- Alon Lavie and Abhaya Agarwal. 2007. Meteor: an automatic metric for mt evaluation with high levels of correlation with human judgments. In Proceedings of the Second Workshop on Statistical Machine Translation, StatMT.
- Llm-cxr: Instruction-finetuned llm for cxr image understanding and generation. arXiv preprint arXiv: 2305.11490.
- Chin-Yew Lin. 2004. ROUGE: A package for automatic evaluation of summaries. In Text Summarization Branches Out.
- Exploring and distilling posterior and prior knowledge for radiology report generation. In CVPR.
- Auto-encoding knowledge graph for unsupervised medical report generation. In NeurIPS.
- Visual instruction tuning. In NeurIPS.
- Unified-io: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916.
- Med-flamingo: a multimodal medical few-shot learner.
- Bleu: a method for automatic evaluation of machine translation. In ACL.
- Generative agents: Interactive simulacra of human behavior. arXiv preprint arXiv: 2304.03442.
- Xraygpt: Chest radiographs summarization using medical vision-language models. arXiv preprint arXiv: 2306.07971.
- Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv: 2307.09288.
- Cider: Consensus-based image description evaluation. In CVPR.
- Adapting llm agents through communication.
- Chestx-ray8: Hospital-scale chest x-ray database and benchmarks on weakly-supervised classification and localization of common thorax diseases. In CVPR.
- Elixr: Towards a general purpose x-ray artificial intelligence system through alignment of large language models and radiology vision encoders. arXiv preprint arXiv: 2308.01317.
- An empirical study of GPT-3 for few-shot knowledge-based VQA. AAAI.
- Automatic generation of medical imaging diagnostic report with hierarchical recurrent neural network. ICDM.
- Pmc-vqa: Visual instruction tuning for medical visual question answering. arXiv preprint arXiv:2305.10415.
- Zishan Gu (3 papers)
- Fenglin Liu (54 papers)
- Changchang Yin (22 papers)
- Ping Zhang (436 papers)