3VL: using Trees to teach Vision & Language models compositional concepts (2312.17345v1)
Abstract: Vision-LLMs (VLMs) have proved effective at aligning image and text representations, producing superior zero-shot results when transferred to many downstream tasks. However, these representations suffer some key shortcomings in Compositional Language Concepts (CLC) understanding such as recognizing objects' attributes, states, and relations between different objects. Moreover, VLMs typically have poor interpretability, making it challenging to debug and mitigate compositional-understanding failures. In this work, we introduce the Tree-augmented Vision-Language (3VL) model architecture and training technique accompanied by our proposed Anchor inference method and Differential Relevance (DiRe) interpretability tool. By expanding the text of an arbitrary image-text pair into a hierarchical tree structure using language analysis tools, 3VL allows inducing this structure into the visual representation learned by the model, enhancing its interpretability and compositional reasoning. Additionally, we show how Anchor, a simple technique for text unification, can be employed to filter nuisance factors while increasing CLC understanding performance, e.g., on the fundamental VL-Checklist benchmark. We also exhibit how DiRe, which performs a differential comparison between VLM relevancy maps, enables us to generate compelling visualizations of the reasons for a model's success or failure.
- S. Doveh, A. Arbelle, S. Harary, R. Panda, R. Herzig, E. Schwartz, D. Kim, R. Giryes, R. Feris, S. Ullman, and L. Karlinsky, “Teaching structured vision & language concepts to vision & language models,” in CVPR, 2023.
- X. Hao, Y. Zhu, S. Appalaraju, A. Zhang, W. Zhang, B. Li, and M. Li, “Mixgen: A new multi-modal data augmentation,” in WACV, 2023.
- P. Cascante-Bonilla, K. Shehada, J. S. Smith, S. Doveh, D. Kim, R. Panda, G. Varol, A. Oliva, V. Ordonez, R. Feris, and L. Karlinsky, “Going beyond nouns with vision & language models using synthetic data,” in ICCV, 2023.
- A. Ray, K. Sikka, A. Divakaran, S. Lee, and G. Burachas, “Sunny and dark outside?! improving answer consistency in vqa through entailed question generation,” in EMNLP, 2019.
- R. Tang, C. Ma, W. E. Zhang, Q. Wu, and X. Yang, “Semantic equivalent adversarial data augmentation for visual question answering,” in European Conference on Computer Vision, 2020, pp. 437–453.
- K. Simonyan, A. Vedaldi, and A. Zisserman, “Deep inside convolutional networks: Visualising image classification models and saliency maps,” arXiv preprint arXiv:1312.6034, 2013.
- R. R. Selvaraju, M. Cogswell, A. Das, R. Vedantam, D. Parikh, and D. Batra, “Grad-cam: Visual explanations from deep networks via gradient-based localization,” in 2017 IEEE International Conference on Computer Vision (ICCV), 2017, pp. 618–626.
- M. T. Ribeiro, S. Singh, and C. Guestrin, “”why should i trust you?” explaining the predictions of any classifier,” in ACM SIGKDD international conference on knowledge discovery and data mining, 2016, pp. 1135–1144.
- M. Sundararajan, A. Taly, and Q. Yan, “Axiomatic attribution for deep networks,” in International conference on machine learning, 2017, pp. 3319–3328.
- S. M. Lundberg and S.-I. Lee, “A unified approach to interpreting model predictions,” Advances in neural information processing systems, vol. 30, 2017.
- S. Kolek, D. A. Nguyen, R. Levie, J. Bruna, and G. Kutyniok, “Cartoon explanations of image classifiers,” in ECCV, 2022, pp. 443–458.
- S. Kolek, R. Windesheim, H. Andrade Loarca, G. Kutyniok, and R. Levie, “Explaining image classifiers with multiscale directional image representation,” in IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2023.
- J. Adebayo, J. Gilmer, M. Muelly, I. Goodfellow, M. Hardt, and B. Kim, “Sanity checks for saliency maps,” in Advances in neural information processing systems, vol. 31, 2018.
- M. Yang and B. Kim, “Benchmarking attribution methods with relative feature importance,” 2019.
- P.-J. Kindermans, S. Hooker, J. Adebayo, M. Alber, K. T. Schütt, S. Dähne, D. Erhan, and B. Kim, “The (un)reliability of saliency methods,” 2017.
- H. Shah, P. Jain, and P. Netrapalli, “Do input gradients highlight discriminative features?” in Advances in Neural Information Processing Systems, vol. 34, 2021.
- D. Slack, S. Hilgard, E. Jia, S. Singh, and H. Lakkaraju, “Fooling lime and shap: Adversarial attacks on post hoc explanation methods,” in AAAI/ACM Conference on AI, Ethics, and Society, 2020, pp. 180–186.
- C. Rudin, “Stop explaining black box machine learning models for high stakes decisions and use interpretable models instead,” Nature Machine Intelligence, vol. 1, no. 5, pp. 206–215, 2019.
- P. W. Koh, T. Nguyen, Y. S. Tang, S. Mussmann, E. Pierson, B. Kim, and P. Liang, “Concept bottleneck models,” in International Conference on Machine Learning, 2020, pp. 5338–5348.
- A. Subramanya, V. Pillai, and H. Pirsiavash, “Fooling network interpretation in image classification,” in IEEE/CVF International Conference on Computer Vision, 2019, pp. 2020–2029.
- G. Schwalbe and B. Finzel, “A comprehensive taxonomy for explainable artificial intelligence: a systematic survey of surveys on methods and concepts,” Data Mining and Knowledge Discovery, 2023.
- H. Chefer, S. Gur, and L. Wolf, “Generic attention-model explainability for interpreting bi-modal and encoder-decoder transformers,” in IEEE/CVF International Conference on Computer Vision (ICCV), October 2021, pp. 397–406.
- Y. Zhang, P. Tino, A. Leonardis, and K. Tang, “A survey on neural network interpretability,” IEEE Transactions on Emerging Topics in Computational Intelligence, vol. 5, no. 5, pp. 726–742, 2021.
- H. Chefer, S. Gur, and L. Wolf, “Transformer interpretability beyond attention visualization,” in IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), June 2021, pp. 782–791.
- S. Bach, A. Binder, G. Montavon, F. Klauschen, K.-R. Müller, and W. Samek, “On pixel-wise explanations for non-linear classifier decisions by layer-wise relevance propagation,” PloS one, vol. 10, no. 7, p. e0130140, 2015.
- M. Bohle, M. Fritz, and B. Schiele, “Convolutional dynamic alignment networks for interpretable classifications,” in IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021, pp. 10 029–10 038.
- D. Alvarez Melis and T. Jaakkola, “Towards robust interpretability with self-explaining neural networks,” Advances in neural information processing systems, vol. 31, 2018.
- A. Chattopadhyay, S. Slocum, B. D. Haeffele, R. Vidal, and D. Geman, “Interpretable by design: Learning predictors by composing interpretable queries,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 45, no. 6, pp. 7430–7443, jun 2023.
- A. Chattopadhyay, K. H. R. Chan, B. D. Haeffele, D. Geman, and R. Vidal, “Variational information pursuit for interpretable predictions,” in International Conference on Learning Representations, 2023.
- C.-K. Yeh, B. Kim, S. Arik, C.-L. Li, T. Pfister, and P. Ravikumar, “On completeness-aware concept-based explanations in deep neural networks,” in Advances in Neural Information Processing Systems, vol. 33, 2020, pp. 20 554–20 565.
- J. Donnelly, A. J. Barnett, and C. Chen, “Deformable protopnet: An interpretable image classifier using deformable prototypes,” in IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 10 265–10 275.
- M. Nauta, R. van Bree, and C. Seifert, “Neural prototype trees for interpretable fine-grained image recognition,” in IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021, pp. 14 933–14 943.
- A. Sarkar, D. Vijaykeerthy, A. Sarkar, and V. N. Balasubramanian, “A framework for learning ante-hoc explainable models via concepts,” in IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 10 286–10 295.
- D. Lindner, J. Kramár, M. Rahtz, T. McGrath, and V. Mikulik, “Tracr: Compiled transformers as a laboratory for interpretability,” arXiv preprint arXiv:2301.05062, 2023.
- C. Mao, R. Teotia, A. Sundar, S. Menon, J. Yang, X. Wang, and C. Vondrick, “Doubly right object recognition: A why prompt for visual rationales,” in CVPR, 2023.
- L. Karlinsky, J. Shtok, A. Alfassy, M. Lichtenstein, S. Harary, E. Schwartz, S. Doveh, P. Sattigeri, R. Feris, A. Bronstein et al., “Starnet: towards weakly supervised few-shot object detection,” in AAAI Conference on Artificial Intelligence, vol. 35, no. 2, 2021, pp. 1743–1753.
- H. Li, J. Song, M. Xue, H. Zhang, J. Ye, L. Cheng, and M. Song, “A survey of neural trees,” arXiv preprint arXiv:2209.03415, 2022.
- K. S. Tai, R. Socher, and C. D. Manning, “Improved semantic representations from tree-structured long short-term memory networks,” in ACL, 2015.
- D. Liu, H. Zhang, F. Wu, and Z.-J. Zha, “Learning to assemble neural module tree networks for visual grounding,” in IEEE/CVF International Conference on Computer Vision, 2019, pp. 4673–4682.
- P. Kontschieder, M. Fiterau, A. Criminisi, and S. R. Bulo, “Deep neural decision forests,” in IEEE International Conference on Computer Vision (ICCV), December 2015.
- R. Tanno, K. Arulkumaran, D. Alexander, A. Criminisi, and A. Nori, “Adaptive neural trees,” in International Conference on Machine Learning, ser. Proceedings of Machine Learning Research, vol. 97, 09–15 Jun 2019, pp. 6166–6175.
- B. Wan, W. Han, Z. Zheng, and T. Tuytelaars, “Unsupervised vision-language grammar induction with shared structure modeling,” in International Conference on Learning Representations, 2022.
- M. Wu, S. Parbhoo, M. C. Hughes, V. Roth, and F. Doshi-Velez, “Optimizing for interpretability in deep neural networks with tree regularization,” Journal of Artificial Intelligence Research, vol. 72, pp. 1–37, 2021.
- Y. Ding, L. Wang, H. Zhang, J. Yi, D. Fan, and B. Gong, “Defending against adversarial attacks using random forest,” in IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), 2019, pp. 105–114.
- G. Cohen and R. Giryes, “Simple post-training robustness using test time augmentations and random forest,” 2021.
- A. Wan, L. Dunlap, D. Ho, J. Yin, S. Lee, S. Petryk, S. A. Bargal, and J. E. Gonzalez, “{NBDT}: Neural-backed decision tree,” in International Conference on Learning Representations, 2021.
- A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, J. Clark et al., “Learning transferable visual models from natural language supervision,” in International Conference on Machine Learning, 2021, pp. 8748–8763.
- C. Jia, Y. Yang, Y. Xia, Y.-T. Chen, Z. Parekh, H. Pham, Q. Le, Y.-H. Sung, Z. Li, and T. Duerig, “Scaling up visual and vision-language representation learning with noisy text supervision,” in International Conference on Machine Learning, 2021, pp. 4904–4916.
- H. Tan and M. Bansal, “Lxmert: Learning cross-modality encoder representations from transformers,” in EMNLP, 2019.
- Y.-C. Chen, L. Li, L. Yu, A. El Kholy, F. Ahmed, Z. Gan, Y. Cheng, and J. Liu, “Uniter: Universal image-text representation learning,” in European conference on computer vision, 2020, pp. 104–120.
- X. Li, X. Yin, C. Li, P. Zhang, X. Hu, L. Zhang, L. Wang, H. Hu, L. Dong, F. Wei et al., “Oscar: Object-semantics aligned pre-training for vision-language tasks,” in European Conference on Computer Vision, 2020, pp. 121–137.
- O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. Bernstein et al., “Imagenet large scale visual recognition challenge,” International journal of computer vision, vol. 115, no. 3, pp. 211–252, 2015.
- T.-Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Dollár, and C. L. Zitnick, “Microsoft coco: Common objects in context,” in European conference on computer vision, 2014, pp. 740–755.
- L. Yao, R. Huang, L. Hou, G. Lu, M. Niu, H. Xu, X. Liang, Z. Li, X. Jiang, and C. Xu, “FILIP: Fine-grained interactive language-image pre-training,” in International Conference on Learning Representations, 2022.
- S. Goel, H. Bansal, S. Bhatia, R. A. Rossi, V. Vinay, and A. Grover, “Cyclip: Cyclic contrastive language-image pretraining,” arXiv preprint arXiv:2205.14459, 2022.
- Y. Li, F. Liang, L. Zhao, Y. Cui, W. Ouyang, J. Shao, F. Yu, and J. Yan, “Supervision exists everywhere: A data efficient contrastive language-image pre-training paradigm,” arXiv preprint arXiv:2110.05208, 2021.
- Y. Gao, J. Liu, Z. Xu, J. Zhang, K. Li, and C. Shen, “Pyramidclip: Hierarchical feature alignment for vision-language model pretraining,” arXiv preprint arXiv:2204.14095, 2022.
- W. Kim, B. Son, and I. Kim, “Vilt: Vision-and-language transformer without convolution or region supervision,” in International Conference on Machine Learning, 2021, pp. 5583–5594.
- J. Yang, J. Duan, S. Tran, Y. Xu, S. Chanda, L. Chen, B. Zeng, T. Chilimbi, and J. Huang, “Vision-language pre-training with triple contrastive learning,” in IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 15 671–15 680.
- M. Shukor, G. Couairon, and M. Cord, “Efficient vision-language pretraining with visual concepts and hierarchical alignment,” in 33rd British Machine Vision Conference (BMVC), 2022.
- J. Li, D. Li, C. Xiong, and S. Hoi, “Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation,” in ICML, 2022.
- T. Zhao, T. Zhang, M. Zhu, H. Shen, K. Lee, X. Lu, and J. Yin, “Vl-checklist: Evaluating pre-trained vision-language models with objects, attributes and relations,” arXiv preprint arXiv:2207.00221, 2022.
- T. Thrush, R. Jiang, M. Bartolo, A. Singh, A. Williams, D. Kiela, and C. Ross, “Winoground: Probing vision and language models for visio-linguistic compositionality,” in IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 5238–5248.
- F. Liu, G. E. T. Emerson, and N. Collier, “Visual spatial reasoning,” Transactions of the Association for Computational Linguistics, 2023.
- A. Ray, F. Radenovic, A. Dubey, B. A. Plummer, R. Krishna, and K. Saenko, “Cola: How to adapt vision-language models to compose objects localized with attributes?” 2023.
- N. Dziri, X. Lu, M. Sclar, X. L. Li, L. Jiang, B. Y. Lin, P. West, C. Bhagavatula, R. L. Bras, J. D. Hwang, S. Sanyal, S. Welleck, X. Ren, A. Ettinger, Z. Harchaoui, and Y. Choi, “Faith and fate: Limits of transformers on compositionality,” in NeurIPS, 2023.
- Y.-C. Chen, L. Li, L. Yu, A. E. Kholy, F. Ahmed, Z. Gan, Y. Cheng, and J. Liu, “Uniter: Universal image-text representation learning,” in ECCV, 2020.
- L. H. Li, M. Yatskar, D. Yin, C.-J. Hsieh, and K.-W. Chang, “Visualbert: A simple and performant baseline for vision and language,” ArXiv, vol. abs/1908.03557, 2019.
- H. H. Tan and M. Bansal, “Lxmert: Learning cross-modality encoder representations from transformers,” in EMNLP, 2019.
- D. Xu, Y. Zhu, C. B. Choy, and L. Fei-Fei, “Scene Graph Generation by Iterative Message Passing,” in CVPR, 2017, pp. 3097–3106.
- R. Herzig, M. Raboh, G. Chechik, J. Berant, and A. Globerson, “Mapping images to scene graphs with permutation-invariant structured prediction,” in Advances in Neural Information Processing Systems, 2018.
- R. Krishna, I. Chami, M. S. Bernstein, and L. Fei-Fei, “Referring relationships,” ECCV, 2018.
- A. Jerbi, R. Herzig, J. Berant, G. Chechik, and A. Globerson, “Learning object detection from captions via textual scene attributes,” ArXiv, vol. abs/2009.14558, 2020.
- M. Raboh, R. Herzig, G. Chechik, J. Berant, and A. Globerson, “Differentiable scene graphs,” in WACV, 2020.
- F. Baradel, N. Neverova, C. Wolf, J. Mille, and G. Mori, “Object level visual reasoning in videos,” in ECCV, 2018, pp. 105–121.
- P. W. Battaglia, J. B. Hamrick, V. Bapst, A. Sanchez-Gonzalez, V. Zambaldi, M. Malinowski, A. Tacchetti, D. Raposo, A. Santoro, R. Faulkner et al., “Relational inductive biases, deep learning, and graph networks,” arXiv preprint arXiv:1806.01261, 2018.
- C. Gao, J. Xu, Y. Zou, and J.-B. Huang, “Drg: Dual relation graph for human-object interaction detection,” ArXiv, vol. abs/2008.11714, 2020.
- K. Kato, Y. Li, and A. Gupta, “Compositional learning for human object interaction,” in ECCV, 2018.
- B. Xu, Y. Wong, J. Li, Q. Zhao, and M. Kankanhalli, “Learning to detect human-object interactions with knowledge,” 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 2019–2028, 2019.
- E. B. Avraham, R. Herzig, K. Mangalam, A. Bar, A. Rohrbach, L. Karlinsky, T. Darrell, and A. Globerson, “Bringing image scene structure to video via frame-clip consistency of object tokens,” in Thirty-Sixth Conference on Neural Information Processing Systems, 2022.
- A. Arnab, C. Sun, and C. Schmid, “Unified graph structured models for video understanding,” in ICCV, 2021.
- J. Materzynska, T. Xiao, R. Herzig, H. Xu, X. Wang, and T. Darrell, “Something-else: Compositional action recognition with spatial-temporal interaction networks,” in IEEE Conference on Computer Vision and Pattern Recognition, 2020.
- R. Herzig, E. Ben-Avraham, K. Mangalam, A. Bar, G. Chechik, A. Rohrbach, T. Darrell, and A. Globerson, “Object-region video transformers,” in Conference on Computer Vision and Pattern Recognition (CVPR), 2022.
- R. Herzig, E. Levi, H. Xu, H. Gao, E. Brosh, X. Wang, A. Globerson, and T. Darrell, “Spatio-temporal action graph networks,” in IEEE International Conference on Computer Vision Workshops, 2019.
- J. Ji, R. Krishna, L. Fei-Fei, and J. C. Niebles, “Action genome: Actions as composition of spatio-temporal scene graphs,” arXiv preprint arXiv:1912.06992, 2019.
- X. Wang and A. Gupta, “Videos as space-time region graphs,” in ECCV, 2018.
- A. Bar, R. Herzig, X. Wang, A. Rohrbach, G. Chechik, T. Darrell, and A. Globerson, “Compositional video synthesis with action graphs,” in ICML, 2021.
- R. Herzig, A. Bar, H. Xu, G. Chechik, T. Darrell, and A. Globerson, “Learning canonical representations for scene graph to image generation,” in European Conference on Computer Vision, 2020.
- J. Johnson, A. Gupta, and L. Fei-Fei, “Image generation from scene graphs,” in IEEE conference on computer vision and pattern recognition, 2018, pp. 1219–1228.
- M. Honnibal and I. Montani, “spaCy 2: Natural language understanding with Bloom embeddings, convolutional neural networks and incremental parsing,” 2017, to appear.
- H. W. Chung, L. Hou, S. Longpre, B. Zoph, Y. Tay, W. Fedus, Y. Li, X. Wang, M. Dehghani, S. Brahma, A. Webson, S. S. Gu, Z. Dai, M. Suzgun, X. Chen, A. Chowdhery, A. Castro-Ros, M. Pellat, K. Robinson, D. Valter, S. Narang, G. Mishra, A. Yu, V. Zhao, Y. Huang, A. Dai, H. Yu, S. Petrov, E. H. Chi, J. Dean, J. Devlin, A. Roberts, D. Zhou, Q. V. Le, and J. Wei, “Scaling instruction-finetuned language models,” 2022.
- G. A. Miller, “Wordnet: A lexical database for english,” Communications of the ACM, vol. 38, no. 11, pp. 39–41, 1995.
- C. Raffel, N. Shazeer, A. Roberts, K. Lee, S. Narang, M. Matena, Y. Zhou, W. Li, and P. J. Liu, “Exploring the limits of transfer learning with a unified text-to-text transformer,” The Journal of Machine Learning Research, vol. 21, no. 1, pp. 5485–5551, 2020.
- E. J. Hu, yelong shen, P. Wallis, Z. Allen-Zhu, Y. Li, S. Wang, L. Wang, and W. Chen, “LoRA: Low-rank adaptation of large language models,” in International Conference on Learning Representations, 2022.
- A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai, T. Unterthiner, M. Dehghani, M. Minderer, G. Heigold, S. Gelly et al., “An image is worth 16x16 words: Transformers for image recognition at scale,” in International Conference on Learning Representations.
- M. Yuksekgonul, F. Bianchi, P. Kalluri, D. Jurafsky, and J. Zou, “When and why vision-language models behave like bags-of-words, and what to do about it?” in International Conference on Learning Representations, 2023.
- C. Li, H. Liu, L. H. Li, P. Zhang, J. Aneja, J. Yang, P. Jin, Y. J. Lee, H. Hu, Z. Liu, and J. Gao, “Elevater: A benchmark and toolkit for evaluating language-augmented visual models,” Neural Information Processing Systems, 2022.
- T. Lüddecke and A. Ecker, “Image segmentation using text and image prompts,” in IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2022, pp. 7086–7096.
- C. Wu, Z. Lin, S. Cohen, T. Bui, and S. Maji, “Phrasecut: Language-based image segmentation in the wild,” 2020.
- Nir Yellinek (1 paper)
- Leonid Karlinsky (79 papers)
- Raja Giryes (155 papers)