Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
102 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

ECOR: Explainable CLIP for Object Recognition (2404.12839v1)

Published 19 Apr 2024 in cs.CV, cs.AI, and cs.LG

Abstract: Large Vision LLMs (VLMs), such as CLIP, have significantly contributed to various computer vision tasks, including object recognition and object detection. Their open vocabulary feature enhances their value. However, their black-box nature and lack of explainability in predictions make them less trustworthy in critical domains. Recently, some work has been done to force VLMs to provide reasonable rationales for object recognition, but this often comes at the expense of classification accuracy. In this paper, we first propose a mathematical definition of explainability in the object recognition task based on the joint probability distribution of categories and rationales, then leverage this definition to fine-tune CLIP in an explainable manner. Through evaluations of different datasets, our method demonstrates state-of-the-art performance in explainable classification. Notably, it excels in zero-shot settings, showcasing its adaptability. This advancement improves explainable object recognition, enhancing trust across diverse applications. The code will be made available online upon publication.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (56)
  1. A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, J. Clark, G. Krueger, and I. Sutskever, “Learning transferable visual models from natural language supervision,” in Proceedings of the 38th International Conference on Machine Learning, ser. Proceedings of Machine Learning Research, M. Meila and T. Zhang, Eds., vol. 139.   PMLR, 18–24 Jul 2021, pp. 8748–8763. [Online]. Available: https://proceedings.mlr.press/v139/radford21a.html
  2. A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet classification with deep convolutional neural networks,” Advances in neural information processing systems, vol. 25, 2012.
  3. K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2016.
  4. P. Gao, S. Geng, R. Zhang, T. Ma, R. Fang, Y. Zhang, H. Li, and Y. Qiao, “Clip-adapter: Better vision-language models with feature adapters,” International Journal of Computer Vision, vol. 132, no. 2, pp. 581–595, 2024.
  5. K. Zhou, J. Yang, C. C. Loy, and Z. Liu, “Learning to prompt for vision-language models,” International Journal of Computer Vision, vol. 130, no. 9, pp. 2337–2348, 2022.
  6. M. Jia, L. Tang, B.-C. Chen, C. Cardie, S. Belongie, B. Hariharan, and S.-N. Lim, “Visual prompt tuning,” in European Conference on Computer Vision.   Springer, 2022, pp. 709–727.
  7. D. Jin, E. Sergeeva, W.-H. Weng, G. Chauhan, and P. Szolovits, “Explainable deep learning in healthcare: A methodological survey from an attribution view,” WIREs Mechanisms of Disease, vol. 14, no. 3, p. e1548, 2022.
  8. A. Singh, S. Sengupta, and V. Lakshminarayanan, “Explainable deep learning models in medical image analysis,” Journal of imaging, vol. 6, no. 6, p. 52, 2020.
  9. J. Dong, S. Chen, M. Miralinaghi, T. Chen, and S. Labi, “Development and testing of an image transformer for explainable autonomous driving systems,” Journal of Intelligent and Connected Vehicles, vol. 5, no. 3, pp. 235–249, 2022.
  10. É. Zablocki, H. Ben-Younes, P. Pérez, and M. Cord, “Explainability of deep vision-based autonomous driving systems: Review and challenges,” International Journal of Computer Vision, vol. 130, no. 10, pp. 2425–2452, 2022.
  11. L. K. Branting, C. Pfeifer, B. Brown, L. Ferro, J. Aberdeen, B. Weiss, M. Pfaff, and B. Liao, “Scalable and explainable legal prediction,” Artificial Intelligence and Law, vol. 29, pp. 213–238, 2021.
  12. F. de Arriba-Pérez, S. García-Méndez, F. J. González-Castaño, and J. González-González, “Explainable machine learning multi-label classification of spanish legal judgements,” Journal of King Saud University-Computer and Information Sciences, vol. 34, no. 10, pp. 10 180–10 192, 2022.
  13. R. R. Selvaraju, M. Cogswell, A. Das, R. Vedantam, D. Parikh, and D. Batra, “Grad-cam: Visual explanations from deep networks via gradient-based localization,” in Proceedings of the IEEE international conference on computer vision, 2017, pp. 618–626.
  14. S. Menon and C. Vondrick, “Visual classification via description from large language models,” 2022.
  15. C. Mao, R. Teotia, A. Sundar, S. Menon, J. Yang, X. Wang, and C. Vondrick, “Doubly right object recognition: A why prompt for visual rationales,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 2722–2732.
  16. T. Brown, B. Mann, N. Ryder, M. Subbiah, J. D. Kaplan, P. Dhariwal, A. Neelakantan, P. Shyam, G. Sastry, A. Askell et al., “Language models are few-shot learners,” Advances in neural information processing systems, vol. 33, pp. 1877–1901, 2020.
  17. Y. Du, Z. Liu, J. Li, and W. X. Zhao, “A survey of vision-language pre-trained models,” arXiv preprint arXiv:2202.10936, 2022.
  18. J. Wang, Z. Yang, X. Hu, L. Li, K. Lin, Z. Gan, Z. Liu, C. Liu, and L. Wang, “Git: A generative image-to-text transformer for vision and language,” arXiv preprint arXiv:2205.14100, 2022.
  19. H. Sharma, M. Agrahari, S. K. Singh, M. Firoj, and R. K. Mishra, “Image captioning: A comprehensive survey,” pp. 325–328, 2020.
  20. W. Wang, H. Bao, L. Dong, J. Bjorck, Z. Peng, Q. Liu, K. Aggarwal, O. K. Mohammed, S. Singhal, S. Som et al., “Image as a foreign language: Beit pretraining for all vision and vision-language tasks,” arXiv preprint arXiv:2208.10442, 2022.
  21. A. C. A. M. de Faria, F. d. C. Bastos, J. V. N. A. da Silva, V. L. Fabris, V. d. S. Uchoa, D. G. d. A. Neto, and C. F. G. d. Santos, “Visual question answering: A survey on techniques and common trends in recent literature,” arXiv preprint arXiv:2305.11033, 2023.
  22. N. K. Lahajal et al., “Enhancing image retrieval: A comprehensive study on photo search using the clip mode,” arXiv preprint arXiv:2401.13613, 2024.
  23. Z. Liu, C. Rodriguez-Opazo, D. Teney, and S. Gould, “Image retrieval on real-life images with pre-trained vision-and-language models,” in Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), October 2021, pp. 2125–2134.
  24. G. Moro, S. Salvatori, and G. Frisoni, “Efficient text-image semantic search: A multi-modal vision-language approach for fashion retrieval,” Neurocomputing, vol. 538, p. 126196, 2023.
  25. M. Zhou, A. Stefanidis, N. Jiang, Z. Sui, and Z. Feng, “Language-led visual grounding for human computer interaction,” 2023.
  26. J. Y. Koh, D. Fried, and R. R. Salakhutdinov, “Generating images with multimodal language models,” Advances in Neural Information Processing Systems, vol. 36, 2024.
  27. H.-K. Ko, G. Park, H. Jeon, J. Jo, J. Kim, and J. Seo, “Large-scale text-to-image generation models for visual artists’ creative works,” in Proceedings of the 28th International Conference on Intelligent User Interfaces, 2023, pp. 919–933.
  28. F. Xu, H. Uszkoreit, Y. Du, W. Fan, D. Zhao, and J. Zhu, “Explainable ai: A brief survey on history, research areas, approaches and challenges,” in Natural Language Processing and Chinese Computing: 8th CCF International Conference, NLPCC 2019, Dunhuang, China, October 9–14, 2019, Proceedings, Part II 8.   Springer, 2019, pp. 563–574.
  29. E. Tjoa and C. Guan, “A survey on explainable artificial intelligence (xai): Toward medical xai,” IEEE transactions on neural networks and learning systems, vol. 32, no. 11, pp. 4793–4813, 2020.
  30. G. Stiglic, P. Kocbek, N. Fijacko, M. Zitnik, K. Verbert, and L. Cilar, “Interpretability of machine learning-based prediction models in healthcare,” Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery, vol. 10, no. 5, p. e1379, 2020.
  31. D. Dave, H. Naik, S. Singhal, and P. Patel, “Explainable ai meets healthcare: A study on heart disease dataset,” arXiv preprint arXiv:2011.03195, 2020.
  32. S. Atakishiyev, M. Salameh, H. Yao, and R. Goebel, “Explainable artificial intelligence for autonomous driving: A comprehensive overview and field guide for future research directions,” arXiv preprint arXiv:2112.11561, 2021.
  33. J. A. Glomsrud, A. Ødegårdstuen, A. L. S. Clair, and Ø. Smogeli, “Trustworthy versus explainable ai in autonomous vessels,” in Proceedings of the International Seminar on Safety and Security of Autonomous Vessels (ISSAV) and European STAMP Workshop and Conference (ESWC), vol. 37, 2019.
  34. X. Zhou, M. Liu, B. L. Zagar, E. Yurtsever, and A. C. Knoll, “Vision language models in autonomous driving and intelligent transportation systems,” arXiv preprint arXiv:2310.14414, 2023.
  35. K. M. Richmond, S. M. Muddamsetty, T. Gammeltoft-Hansen, H. P. Olsen, and T. B. Moeslund, “Explainable ai and law: an evidential survey,” Digital Society, vol. 3, no. 1, p. 1, 2024.
  36. K. Simonyan, A. Vedaldi, and A. Zisserman, “Deep inside convolutional networks: Visualising image classification models and saliency maps,” arXiv preprint arXiv:1312.6034, 2013.
  37. M. T. Ribeiro, S. Singh, and C. Guestrin, “” why should i trust you?” explaining the predictions of any classifier,” in Proceedings of the 22nd ACM SIGKDD international conference on knowledge discovery and data mining, 2016, pp. 1135–1144.
  38. P. Chen, Q. Li, S. Biaz, T. Bui, and A. Nguyen, “gscorecam: What objects is clip looking at?” in Proceedings of the Asian Conference on Computer Vision (ACCV), December 2022, pp. 1959–1975.
  39. G. Boccignone, V. Cuculo, and A. D’Amelio, “Problems with saliency maps,” in Image Analysis and Processing–ICIAP 2019: 20th International Conference, Trento, Italy, September 9–13, 2019, Proceedings, Part II 20.   Springer, 2019, pp. 35–46.
  40. M. F. Naeem, M. G. Z. A. Khan, Y. Xian, M. Z. Afzal, D. Stricker, L. Van Gool, and F. Tombari, “I2mvformer: Large language model generated multi-view document supervision for zero-shot image classification,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), June 2023, pp. 15 169–15 179.
  41. G. Patterson and J. Hays, “Coco attributes: Attributes for people, animals, and objects,” in Computer Vision–ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, October 11-14, 2016, Proceedings, Part VI 14.   Springer, 2016, pp. 85–100.
  42. R. Krishna, Y. Zhu, O. Groth, J. Johnson, K. Hata, J. Kravitz, S. Chen, Y. Kalantidis, L.-J. Li, D. A. Shamma et al., “Visual genome: Connecting language and vision using crowdsourced dense image annotations,” International journal of computer vision, vol. 123, pp. 32–73, 2017.
  43. K. Pham, K. Kafle, Z. Lin, Z. Ding, S. Cohen, Q. Tran, and A. Shrivastava, “Learning to predict visual attributes in the wild,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2021, pp. 13 018–13 028.
  44. V. Escorcia, J. Carlos Niebles, and B. Ghanem, “On the relationship between visual attributes and convolutional networks,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2015, pp. 1256–1264.
  45. S. Singla and S. Feizi, “Salient imagenet: How to discover spurious features in deep learning?” arXiv preprint arXiv:2110.04301, 2021.
  46. N. Ding, Y. Qin, G. Yang, F. Wei, Z. Yang, Y. Su, S. Hu, Y. Chen, C.-M. Chan, W. Chen et al., “Parameter-efficient fine-tuning of large-scale pre-trained language models,” Nature Machine Intelligence, vol. 5, no. 3, pp. 220–235, 2023.
  47. Y. Xin, S. Luo, H. Zhou, J. Du, X. Liu, Y. Fan, Q. Li, and Y. Du, “Parameter-efficient fine-tuning for pre-trained vision models: A survey,” arXiv preprint arXiv:2402.02242, 2024.
  48. E. J. Hu, Y. Shen, P. Wallis, Z. Allen-Zhu, Y. Li, S. Wang, L. Wang, and W. Chen, “Lora: Low-rank adaptation of large language models,” arXiv preprint arXiv:2106.09685, 2021.
  49. B. Lester, R. Al-Rfou, and N. Constant, “The power of scale for parameter-efficient prompt tuning,” arXiv preprint arXiv:2104.08691, 2021.
  50. X. Liu, K. Ji, Y. Fu, W. L. Tam, Z. Du, Z. Yang, and J. Tang, “P-tuning v2: Prompt tuning can be comparable to fine-tuning universally across scales and tasks,” arXiv preprint arXiv:2110.07602, 2021.
  51. A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai, T. Unterthiner, M. Dehghani, M. Minderer, G. Heigold, S. Gelly, J. Uszkoreit, and N. Houlsby, “An image is worth 16x16 words: Transformers for image recognition at scale,” in International Conference on Learning Representations, 2021. [Online]. Available: https://openreview.net/forum?id=YicbFdNTTy
  52. A. Krizhevsky, G. Hinton et al., “Learning multiple layers of features from tiny images,” 2009.
  53. F.-F. Li, M. Andreeto, M. Ranzato, and P. Perona, “Caltech 101,” Apr 2022.
  54. L. Bossard, M. Guillaumin, and L. Van Gool, “Food-101 – mining discriminative components with random forests,” in European Conference on Computer Vision, 2014.
  55. J. Xiao, J. Hays, K. A. Ehinger, A. Oliva, and A. Torralba, “Sun database: Large-scale scene recognition from abbey to zoo,” in 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 2010, pp. 3485–3492.
  56. O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. Bernstein, A. C. Berg, and L. Fei-Fei, “ImageNet Large Scale Visual Recognition Challenge,” International Journal of Computer Vision (IJCV), vol. 115, no. 3, pp. 211–252, 2015.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (4)
  1. Ali Rasekh (4 papers)
  2. Sepehr Kazemi Ranjbar (2 papers)
  3. Milad Heidari (1 paper)
  4. Wolfgang Nejdl (46 papers)
Citations (2)
X Twitter Logo Streamline Icon: https://streamlinehq.com

Tweets