Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
97 tokens/sec
GPT-4o
53 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

LVLM-Interpret: An Interpretability Tool for Large Vision-Language Models (2404.03118v3)

Published 3 Apr 2024 in cs.CV

Abstract: In the rapidly evolving landscape of artificial intelligence, multi-modal LLMs are emerging as a significant area of interest. These models, which combine various forms of data input, are becoming increasingly popular. However, understanding their internal mechanisms remains a complex task. Numerous advancements have been made in the field of explainability tools and mechanisms, yet there is still much to explore. In this work, we present a novel interactive application aimed towards understanding the internal mechanisms of large vision-LLMs. Our interface is designed to enhance the interpretability of the image patches, which are instrumental in generating an answer, and assess the efficacy of the LLM in grounding its output in the image. With our application, a user can systematically investigate the model and uncover system limitations, paving the way for enhancements in system capabilities. Finally, we present a case study of how our application can aid in understanding failure mechanisms in a popular large multi-modal model: LLaVA.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (50)
  1. Gradio. https://www.gradio.app/. Accessed: 2024-03-13.
  2. Vl-interpret: An interactive visualization tool for interpreting vision-language transformers. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 21406–21415, 2022.
  3. Swin transformer-based object detection model using explainable meta-learning mining. Applied Sciences, 13(5):3213, 2023.
  4. Qwen-vl: A frontier large vision-language model with versatile abilities. arXiv preprint arXiv:2308.12966, 2023.
  5. A multimodal vision transformer for interpretable fusion of functional and structural neuroimaging data. bioRxiv, pages 2023–07, 2023.
  6. Generic attention-model explainability for interpreting bi-modal and encoder-decoder transformers. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 397–406, 2021a.
  7. Transformer interpretability beyond attention visualization. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 782–791, 2021b.
  8. Explaining transformer-based image captioning models: An empirical analysis. AI Communications, 35(2):111–129, 2022.
  9. Towards class interpretable vision transformer with multi-class-tokens. In Chinese Conference on Pattern Recognition and Computer Vision (PRCV), pages 609–622. Springer, 2022.
  10. Explainability in image captioning based on the latent space. Neurocomputing, 546:126319, 2023.
  11. An interpretable transformer network for the retinal disease classification using optical coherence tomography. Scientific Reports, 13(1):3637, 2023.
  12. Survey of hallucination in natural language generation. ACM Computing Surveys, 55(12):1–38, 2023.
  13. Towards evaluating explanations of vision transformers for medical imaging. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Workshops, pages 3726–3732, 2023.
  14. Visual instruction tuning. Advances in neural information processing systems, 36, 2024.
  15. Multimodal contrastive transformer for explainable recommendation. IEEE Transactions on Computational Social Systems, 2023.
  16. Dime: Fine-grained interpretations of multimodal models via disentangled local explanations. In Proceedings of the 2022 AAAI/ACM Conference on AI, Ethics, and Society, pages 455–467, 2022.
  17. Visualizing and understanding patch interactions in vision transformer. IEEE Transactions on Neural Networks and Learning Systems, pages 1–10, 2023.
  18. Ifi: Interpreting for improving: A multimodal transformer with an interpretability technique for recognition of risk events. In International Conference on Multimedia Modeling, pages 117–131. Springer, 2024.
  19. Scalable and robust transformer decoders for interpretable image classification with foundation models. arXiv preprint arXiv:2403.04125, 2024.
  20. xvitcos: Explainable vision transformer based covid-19 screening using radiography. IEEE Journal of Translational Engineering in Health and Medicine, 10:1–10, 2022.
  21. Vision diffmask: Faithful interpretation of vision transformers with differentiable patch masking. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Workshops, pages 3756–3763, 2023.
  22. Vision-language transformer for interpretable pathology visual question answering. IEEE Journal of Biomedical and Health Informatics, 27(4):1681–1690, 2022.
  23. OpenAI. GPT-4 technical report. CoRR, abs/2303.08774, 2023.
  24. OpenAi. Gpt-4v(ision) system card. 2023.
  25. IA-RED2 : Interpretability-aware redundancy reduction for vision transformers. Advances in Neural Information Processing Systems, 34:24898–24911, 2021.
  26. Focused attention in transformers for interpretable classification of retinal images. Medical Image Analysis, 82:102608, 2022.
  27. Interpretability-aware vision transformer. arXiv preprint arXiv:2309.08035, 2023.
  28. Learning transferable visual models from natural language supervision. In International conference on machine learning, pages 8748–8763. PMLR, 2021.
  29. Investigation of explainability techniques for multimodal transformers. In Australasian Conference on Data Mining, pages 90–98. Springer, 2022.
  30. Ancestral graph markov models. The Annals of Statistics, 30(4):962–1030, 2002.
  31. Attention-based interpretability with concept transformers. In International Conference on Learning Representations, 2022.
  32. Iterative causal discovery in the possible presence of latent confounders and selection bias. Advances in Neural Information Processing Systems, 34:2454–2465, 2021.
  33. Causal interpretation of self-attention in pre-trained transformers. Advances in Neural Information Processing Systems, 36, 2024.
  34. Covid-transformer: Interpretable covid-19 detection using vision transformer for healthcare. International Journal of Environmental Research and Public Health, 18(21):11086, 2021.
  35. Explain and improve: Lrp-inference fine-tuning for image captioning models. Information Fusion, 77:233–246, 2022.
  36. Multimodn—multimodal, multi-task, interpretable modular networks. Advances in Neural Information Processing Systems, 36, 2024.
  37. Gemini: a family of highly capable multimodal models. arXiv preprint arXiv:2312.11805, 2023.
  38. Eyes wide shut? exploring the visual shortcomings of multimodal llms. arXiv preprint arXiv:2401.06209, 2024.
  39. Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971, 2023a.
  40. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288, 2023b.
  41. Attention is all you need. Advances in neural information processing systems, 30, 2017.
  42. Evaluation and analysis of hallucination in large vision-language models. arXiv preprint arXiv:2308.15126, 2023.
  43. The what-if tool: Interactive probing of machine learning models. IEEE transactions on visualization and computer graphics, 26(1):56–65, 2019.
  44. Towards interpretable object detection by unfolding latent structures. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 6033–6043, 2019.
  45. Hallucination is inevitable: An innate limitation of large language models. arXiv preprint arXiv:2401.11817, 2024.
  46. Protopformer: Concentrating on prototypical parts in vision transformers for interpretable image recognition. arXiv preprint arXiv:2208.10431, 2022.
  47. Jiji Zhang. On the completeness of orientation rules for causal discovery in the presence of latent confounders and selection bias. Artificial Intelligence, 172(16-17):1873–1896, 2008.
  48. Interpreting cnn knowledge via an explanatory graph. In Proceedings of the AAAI conference on artificial intelligence, 2018.
  49. Interpreting cnns via decision trees. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 6261–6270, 2019.
  50. Analyzing and mitigating object hallucination in large vision-language models. In The Twelfth International Conference on Learning Representations, 2024.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (10)
  1. Gabriela Ben Melech Stan (7 papers)
  2. Raanan Yehezkel Rohekar (1 paper)
  3. Yaniv Gurwicz (15 papers)
  4. Matthew Lyle Olson (10 papers)
  5. Anahita Bhiwandiwalla (15 papers)
  6. Estelle Aflalo (11 papers)
  7. Chenfei Wu (32 papers)
  8. Nan Duan (172 papers)
  9. Shao-Yen Tseng (23 papers)
  10. Vasudev Lal (44 papers)
Citations (2)

Summary

We haven't generated a summary for this paper yet.