Quilt-LLaVA: Visual Instruction Tuning by Extracting Localized Narratives from Open-Source Histopathology Videos (2312.04746v2)
Abstract: Diagnosis in histopathology requires a global whole slide images (WSIs) analysis, requiring pathologists to compound evidence from different WSI patches. The gigapixel scale of WSIs poses a challenge for histopathology multi-modal models. Training multi-model models for histopathology requires instruction tuning datasets, which currently contain information for individual image patches, without a spatial grounding of the concepts within each patch and without a wider view of the WSI. Therefore, they lack sufficient diagnostic capacity for histopathology. To bridge this gap, we introduce Quilt-Instruct, a large-scale dataset of 107,131 histopathology-specific instruction question/answer pairs, grounded within diagnostically relevant image patches that make up the WSI. Our dataset is collected by leveraging educational histopathology videos from YouTube, which provides spatial localization of narrations by automatically extracting the narrators' cursor positions. Quilt-Instruct supports contextual reasoning by extracting diagnosis and supporting facts from the entire WSI. Using Quilt-Instruct, we train Quilt-LLaVA, which can reason beyond the given single image patch, enabling diagnostic reasoning across patches. To evaluate Quilt-LLaVA, we propose a comprehensive evaluation dataset created from 985 images and 1283 human-generated question-answers. We also thoroughly evaluate Quilt-LLaVA using public histopathology datasets, where Quilt-LLaVA significantly outperforms SOTA by over 10% on relative GPT-4 score and 4% and 9% on open and closed set VQA. Our code, data, and model are publicly accessible at quilt-llava.github.io.
- From image to diagnosis: Characterizing sources of error in histopathologic interpretation. Modern Pathology, 36(7):100162, 2023.
- Hyaline Change. Cellular responses to stress and toxic insults: Adaptation, injury, and death. Robbins and Cotran Pathologic Basis of Disease, Professional Edition E-Book, page 1, 2009.
- Vicuna: An open-source chatbot impressing gpt-4 with 90%* chatgpt quality. See https://vicuna. lmsys. org (accessed 14 April 2023), 2023.
- Llama-adapter v2: Parameter-efficient visual instruction model. arXiv preprint arXiv:2304.15010, 2023.
- Medalpaca–an open-source collection of medical conversational ai models and training data. arXiv preprint arXiv:2304.08247, 2023.
- Pathvqa: 30000+ questions for medical visual question answering. arXiv preprint arXiv:2003.10286, 2020.
- Quilt-1m: One million image-text pairs for histopathology. arXiv preprint arXiv:2306.11207, 2023.
- Mistral 7b. arXiv preprint arXiv:2310.06825, 2023.
- Mimic-it: Multi-modal in-context instruction tuning. arXiv preprint arXiv:2306.05425, 2023a.
- Llava-med: Training a large language-and-vision assistant for biomedicine in one day. arXiv preprint arXiv:2306.00890, 2023b.
- Microsoft coco: Common objects in context. In Computer Vision–ECCV 2014: 13th European Conference, Zurich, Switzerland, September 6-12, 2014, Proceedings, Part V 13, pages 740–755. Springer, 2014.
- Improved baselines with visual instruction tuning. arXiv preprint arXiv:2310.03744, 2023a.
- Visual instruction tuning. arXiv preprint arXiv:2304.08485, 2023b.
- Med-flamingo: a multimodal medical few-shot learner. arXiv preprint arXiv:2307.15189, 2023.
- R OpenAI. Gpt-4 technical report. arXiv, pages 2303–08774, 2023.
- Facenet: A unified embedding for face recognition and clustering. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 815–823, 2015.
- What does clip know about a red circle? visual prompt engineering for vlms, 2023.
- Alpaca: A strong, replicable instruction-following model. Stanford Center for Research on Foundation Models. https://crfm. stanford. edu/2023/03/13/alpaca. html, 3(6):7, 2023.
- Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971, 2023.
- Connecting vision and language with video localized narratives. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 2461–2471, 2023.
- Merlot: Multimodal neural script knowledge models. Advances in Neural Information Processing Systems, 34:23634–23651, 2021.
- Merlot reserve: Neural script knowledge through vision and language and sound. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 16375–16387, 2022.
- Pathnarratives: Data annotation for pathological human-ai collaborative diagnosis. Frontiers in Medicine, 9:1070072, 2023a.
- Large-scale domain-specific pretraining for biomedical vision-language processing. arXiv preprint arXiv:2303.00915, 2023b.
- Pmc-vqa: Visual instruction tuning for medical visual question answering. arXiv preprint arXiv:2305.10415, 2023c.
- Minigpt-4: Enhancing vision-language understanding with advanced large language models. arXiv preprint arXiv:2304.10592, 2023.
- Mehmet Saygin Seyfioglu (12 papers)
- Wisdom O. Ikezogwo (3 papers)
- Fatemeh Ghezloo (4 papers)
- Ranjay Krishna (116 papers)
- Linda Shapiro (23 papers)