Papers
Topics
Authors
Recent
2000 character limit reached

Visual Explanations of Image-Text Representations via Multi-Modal Information Bottleneck Attribution

Published 28 Dec 2023 in cs.CV, cs.AI, and cs.LG | (2312.17174v2)

Abstract: Vision-language pretrained models have seen remarkable success, but their application to safety-critical settings is limited by their lack of interpretability. To improve the interpretability of vision-LLMs such as CLIP, we propose a multi-modal information bottleneck (M2IB) approach that learns latent representations that compress irrelevant information while preserving relevant visual and textual features. We demonstrate how M2IB can be applied to attribution analysis of vision-language pretrained models, increasing attribution accuracy and improving the interpretability of such models when applied to safety-critical domains such as healthcare. Crucially, unlike commonly used unimodal attribution methods, M2IB does not require ground truth labels, making it possible to audit representations of vision-language pretrained models when multiple modalities but no ground-truth data is available. Using CLIP as an example, we demonstrate the effectiveness of M2IB attribution and show that it outperforms gradient-based, perturbation-based, and attention-based attribution methods both qualitatively and quantitatively.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (33)
  1. Sanity checks for saliency maps. CoRR, abs/1810.03292, 2018.
  2. Deep variational information bottleneck. In International Conference on Learning Representations, 2017.
  3. Making the most of text semantics to improve biomedical vision–language processing. In Shai Avidan, Gabriel Brostow, Moustapha Cissé, Giovanni Maria Farinella, and Tal Hassner, editors, Computer Vision – ECCV 2022, pages 1–21, Cham, 2022. Springer Nature Switzerland. ISBN 978-3-031-20059-5.
  4. Grad-CAM++: Generalized gradient-based visual explanations for deep convolutional networks. In 2018 IEEE Winter Conference on Applications of Computer Vision (WACV), pages 839–847, 2018. doi: 10.1109/WACV.2018.00097.
  5. Transformer interpretability beyond attention visualization. CoRR, abs/2012.09838, 2020.
  6. Generic attention-model explainability for interpreting bi-modal and encoder-decoder transformers. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 397–406, October 2021.
  7. An image is worth 16x16 words: Transformers for image recognition at scale. In International Conference on Learning Representations, 2021.
  8. Retrieval-based chest x-ray report generation using a pre-trained contrastive language-image model. In Proceedings of Machine Learning for Health, volume 158 of Proceedings of Machine Learning Research, pages 209–219, 2021.
  9. ImageBind: One embedding space to bind them all, 2023.
  10. Deep residual learning for image recognition. In 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 770–778, 2016. doi: 10.1109/CVPR.2016.90.
  11. A benchmark for interpretability methods in deep neural networks. In H. Wallach, H. Larochelle, A. Beygelzimer, F. d'Alché-Buc, E. Fox, and R. Garnett, editors, Advances in Neural Information Processing Systems, volume 32. Curran Associates, Inc., 2019.
  12. Inserting information bottlenecks for attribution in transformers. In Findings of the Association for Computational Linguistics: EMNLP 2020, pages 3850–3857, Online, November 2020. Association for Computational Linguistics. doi: 10.18653/v1/2020.findings-emnlp.343.
  13. MIMIC-CXR: A large publicly available database of labeled chest radiographs. CoRR, abs/1901.07042, 2019.
  14. Very deep convolutional neural network based image classification using small training sample size. In 2015 3rd IAPR Asian Conference on Pattern Recognition (ACPR), pages 730–734, 2015. doi: 10.1109/ACPR.2015.7486599.
  15. Vision-and-language pretrained models: A survey. In Luc De Raedt and Luc De Raedt, editors, Proceedings of the 31st International Joint Conference on Artificial Intelligence, IJCAI 2022, IJCAI International Joint Conference on Artificial Intelligence, pages 5530–5537. International Joint Conferences on Artificial Intelligence, 2022. doi: 10.24963/IJCAI.2022/773. 31st International Joint Conference on Artificial Intelligence, IJCAI 2022 ; Conference date: 23-07-2022 Through 29-07-2022.
  16. Exploring models and data for remote sensing image caption generation. IEEE Transactions on Geoscience and Remote Sensing, 56(4):2183–2195, 2017. doi: 10.1109/TGRS.2017.2776321.
  17. A unified approach to interpreting model predictions. In I. Guyon, U. Von Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett, editors, Advances in Neural Information Processing Systems, volume 30. Curran Associates, Inc., 2017.
  18. SLIP: self-supervision meets language-image pre-training. CoRR, abs/2112.12750, 2021.
  19. Radiology Objects in cOntext (ROCO): A multimodal image dataset. In CVII-STENT/LABELS@MICCAI, 2018.
  20. RISE: Randomized input sampling for explanation of black-box models. In Proceedings of the British Machine Vision Conference (BMVC), 2018.
  21. Learning transferable visual models from natural language supervision. CoRR, abs/2103.00020, 2021.
  22. "why should I trust you?": Explaining the predictions of any classifier. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, San Francisco, CA, USA, August 13-17, 2016, pages 1135–1144, 2016.
  23. A consistent and efficient evaluation strategy for attribution methods. In Proceedings of the 39th International Conference on Machine Learning, pages 18770–18795. PMLR, 2022.
  24. Restricting the flow: Information bottlenecks for attribution. In International Conference on Learning Representations, 2020.
  25. Grad-CAM: Why did you say that? visual explanations from deep networks via gradient-based localization. CoRR, abs/1610.02391, 2016.
  26. Conceptual captions: A cleaned, hypernymed, image alt-text dataset for automatic image captioning. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 2556–2565, Melbourne, Australia, July 2018. Association for Computational Linguistics. doi: 10.18653/v1/P18-1238.
  27. How much can CLIP benefit vision-and-language tasks? In International Conference on Learning Representations, 2022.
  28. Deep inside convolutional networks: Visualising image classification models and saliency maps. In Workshop at International Conference on Learning Representations, 2014.
  29. Striving for Simplicity: The All Convolutional Net. arXiv e-prints, art. arXiv:1412.6806, December 2014. doi: 10.48550/arXiv.1412.6806.
  30. Axiomatic attribution for deep networks. In Doina Precup and Yee Whye Teh, editors, Proceedings of the 34th International Conference on Machine Learning, volume 70 of Proceedings of Machine Learning Research, pages 3319–3328. PMLR, 06–11 Aug 2017.
  31. Deep learning and the information bottleneck principle. In 2015 ieee information theory workshop (itw), pages 1–5. IEEE, 2015.
  32. The information bottleneck method, 2000.
  33. Score-CAM: Score-weighted visual explanations for convolutional neural networks. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition workshops, pages 24–25, 2020.
Citations (13)

Summary

Paper to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Continue Learning

We haven't generated follow-up questions for this paper yet.

Collections

Sign up for free to add this paper to one or more collections.