Stitching Gaps: Fusing Situated Perceptual Knowledge with Vision Transformers for High-Level Image Classification (2402.19339v1)
Abstract: The increasing demand for automatic high-level image understanding, particularly in detecting abstract concepts (AC) within images, underscores the necessity for innovative and more interpretable approaches. These approaches need to harmonize traditional deep vision methods with the nuanced, context-dependent knowledge humans employ to interpret images at intricate semantic levels. In this work, we leverage situated perceptual knowledge of cultural images to enhance performance and interpretability in AC image classification. We automatically extract perceptual semantic units from images, which we then model and integrate into the ARTstract Knowledge Graph (AKG). This resource captures situated perceptual semantics gleaned from over 14,000 cultural images labeled with ACs. Additionally, we enhance the AKG with high-level linguistic frames. We compute KG embeddings and experiment with relative representations and hybrid approaches that fuse these embeddings with visual transformer embeddings. Finally, for interpretability, we conduct posthoc qualitative analyses by examining model similarities with training instances. Our results show that our hybrid KGE-ViT methods outperform existing techniques in AC image classification. The posthoc interpretability analyses reveal the visual transformer's proficiency in capturing pixel-level visual attributes, contrasting with our method's efficacy in representing more abstract and semantic scene elements. We demonstrate the synergy and complementarity between KGE embeddings' situated perceptual knowledge and deep visual model's sensory-perceptual understanding for AC image classification. This work suggests a strong potential of neuro-symbolic methods for knowledge integration and robust image representation for use in downstream intricate visual comprehension tasks. All the materials and code are available online.
- “ArtEmis: Affective Language for Visual Art” In IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2021, virtual, June 19-25, 2021 Nashville, TN, USA: Computer Vision Foundation / IEEE, 2021, pp. 11569–11579 DOI: 10.1109/CVPR46437.2021.01140
- Somak Aditya, Yezhou Yang and Chitta Baral “Explicit reasoning over end-to-end neural architectures for visual question answering” In Proceedings of the AAAI Conference on Artificial Intelligence 32, 2018
- Somak Aditya, Yezhou Yang and Chitta Baral “Integrating knowledge and reasoning in image understanding” In 28th International Joint Conference on Artificial Intelligence, IJCAI 2019, 2019, pp. 6252–6259 International Joint Conferences on Artificial Intelligence
- “A public domain dataset for human activity recognition using smartphones.” In Esann 3, 2013, pp. 3
- “Distant Viewing Toolkit: A Python Package for the Analysis of Visual Culture” In Journal of Open Source Software 5.45, 2020, pp. 1800 DOI: 10.21105/joss.01800
- “Modular Design Patterns for Hybrid Learning and Reasoning Systems: a taxonomy, patterns and use cases” In arXiv:2102.11965 [cs] 51.9 Springer, 2021, pp. 6528–6546
- “A Survey on Word Meta-Embedding Learning” In Proceedings of the Thirty-First International Joint Conference on Artificial Intelligence, IJCAI 2022, Vienna, Austria, 23-29 July 2022 ijcai.org, 2022, pp. 5402–5409 DOI: 10.24963/IJCAI.2022/758
- “Translating embeddings for modeling multi-relational data” In Advances in neural information processing systems 26, 2013
- Ali Borji “Negative results in computer vision: A perspective” In Image and Vision Computing 69 Elsevier, 2018, pp. 1–8
- Jerome Bruner “Culture and human development: A new look” In Human development 33.6 Karger Publishers, 1990, pp. 344–355
- “Scalable Theory-Driven Regularization of Scene Graph Generation Models” In Thirty-Seventh AAAI Conference on Artificial Intelligence, AAAI 2023, Thirty-Fifth Conference on Innovative Applications of Artificial Intelligence, IAAI 2023, Thirteenth Symposium on Educational Advances in Artificial Intelligence, EAAI 2023, Washington, DC, USA, February 7-14, 2023 AAAI Press, 2023, pp. 6850–6859 DOI: 10.1609/AAAI.V37I6.25839
- “End-to-end object detection with transformers” In European conference on computer vision, 2020, pp. 213–229 Springer
- “Iterative visual reasoning beyond convolutions” In Proc. of CVPR 2018, 2018, pp. 7239–7248 IEEE
- “Automated multimodal sensemaking: Ontology-based integration of linguistic frames and visual data” In Computers in Human Behavior 150, 2024, pp. 107997 DOI: https://doi.org/10.1016/j.chb.2023.107997
- Sebastian J Crutch, Basil H Ridha and Elizabeth K Warrington “The different frameworks underlying abstract and concrete knowledge: Evidence from a bilingual patient with a semantic refractory access dysphasia” In Neurocase 12.3 Taylor & Francis, 2006, pp. 151–163
- Stamatia Dasiopoulou, Ioannis Kompatsiaris and Michael G Strintzis “Applying fuzzy DLs in the extraction of image semantics” In Journal on data semantics XIV Springer, 2009, pp. 105–132
- “Qualitative differences in the representation of abstract versus concrete words: Evidence from the visual-world paradigm” In Cognition 110.2 Elsevier, 2009, pp. 284–292
- “Multimodal learning with graphs” In Nat. Mac. Intell. 5.4, 2023, pp. 340–350 DOI: 10.1038/S42256-023-00624-6
- Chaz Firestone and Brian J Scholl “Cognition does not affect perception: Evaluating the evidence for “top-down” effects” In Behavioral and brain sciences 39 Cambridge University Press, 2016
- “Framester: A wide coverage linguistic linked data hub” In European Knowledge Acquisition Workshop, Lecture Notes in Computer Science Cham: Springer International Publishing, 2016, pp. 239–254 Springer DOI: 10.1007/978-3-319-49004-5“˙16
- Arushi Goel, Keng Teck Ma and Cheston Tan “An End-To-End Network for Generating Social Relationship Graphs” In 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Long Beach, CA, USA: IEEE, 2019, pp. 11178–11187 DOI: 10.1109/CVPR.2019.01144
- “Predicting Facial Beauty without Landmarks” In Computer Vision – ECCV 2010, Lecture Notes in Computer Science Berlin, Heidelberg: Springer, 2010, pp. 434–447 DOI: 10.1007/978-3-642-15567-3“˙32
- Meiqi Guo, Rebecca Hwa and Adriana Kovashka “Detecting Persuasive Atypicality by Modeling Contextual Compatibility” In 2021 IEEE/CVF International Conference on Computer Vision (ICCV) Montreal, QC, Canada: IEEE, 2021, pp. 952–962 DOI: 10.1109/ICCV48922.2021.00101
- Wenzhong Guo, Jianwen Wang and Shiping Wang “Deep multimodal representation learning: A survey” In IEEE Access 7 IEEE, 2019, pp. 63373–63394
- Catherine Havasi, Robert Speer and Jason Alonso “ConceptNet 3: a flexible, multilingual semantic network for common sense knowledge” In Recent advances in natural language processing, 2007, pp. 27–29 John Benjamins Philadelphia, PA
- “Deep Residual Learning for Image Recognition” In 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR) Las Vegas, NV, USA: IEEE, 2016, pp. 770–778 DOI: 10.1109/CVPR.2016.90
- Paul Hoffman “Concepts, control, and context: A connectionist account of normal and disordered semantic cognition.” In Psychological Review 125.3, 2018, pp. 293 DOI: 10.1037/rev0000094
- Derek Hoiem, Alexei A Efros and Martial Hebert “Putting objects in perspective” In International Journal of Computer Vision 80 Springer, 2008, pp. 3–15
- “Inferring Visual Persuasion via Body Language, Setting, and Deep Features” In IEEE Computer Society Conference on Computer Vision and Pattern Recognition Workshops, 2016, pp. 778–784 DOI: 10.1109/CVPRW.2016.102
- “Automatic Understanding of Image and Video Advertisements” In Proceedings of the IEEE conference on computer vision and pattern recognition, 2017, pp. 1705–1715
- “Automatic understanding of image and video advertisements” In Proceedings of the IEEE conference on computer vision and pattern recognition, 2017, pp. 1705–1715
- Phillip Isola, Joseph J Lim and Edward H Adelson “Discovering states and transformations in image collections” In Proceedings of the IEEE conference on computer vision and pattern recognition, 2015, pp. 1383–1391
- “A Review on Methods and Applications in Multimodal Deep Learning” In ACM Trans. Multim. Comput. Commun. Appl. 19.2s, 2023, pp. 76:1–76:41 DOI: 10.1145/3545572
- “Intentonomy: a Dataset and Study towards Human Intent Understanding” In 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Nashville, TN, USA: IEEE, 2021, pp. 12981–12991 DOI: 10.1109/CVPR46437.2021.01279
- “Visual Persuasion: Inferring Communicative Intents of Images” In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2014, pp. 216–223
- “Symbolic image detection using scene and knowledge graphs” In arXiv preprint arXiv:2206.04863, 2022
- “Fairface: Face attribute dataset for balanced race, gender, and age for bias measurement and mitigation” In Proceedings of the IEEE/CVF winter conference on applications of computer vision, 2021, pp. 1548–1558
- “The representation of abstract words: Why emotion matters” In Journal of Experimental Psychology: General 140.1 American Psychological Association, 2011, pp. 14–34 DOI: 10.1037/a0021446
- “Visual Genome: Connecting Language and Vision Using Crowdsourced Dense Image Annotations” In arXiv:1602.07332 [cs] 123.1 Springer, 2016, pp. 32–73
- “Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation” In International Conference on Machine Learning, 2022, pp. 12888–12900 PMLR
- “Dual-Glance Model for Deciphering Social Relationships” In 2017 IEEE International Conference on Computer Vision (ICCV) Venice: IEEE, 2017, pp. 2669–2678 DOI: 10.1109/ICCV.2017.289
- “Situation Recognition with Graph Neural Networks” In 2017 IEEE International Conference on Computer Vision (ICCV) Venice: IEEE, 2017, pp. 4183–4192 DOI: 10.1109/ICCV.2017.448
- “Graph-Based Social Relation Reasoning” In Computer Vision – ECCV 2020, Lecture Notes in Computer Science Cham: Springer International Publishing, 2020, pp. 18–34 DOI: 10.1007/978-3-030-58555-6“˙2
- “GraphAdapter: Tuning Vision-Language Models With Dual Knowledge Graph” In CoRR abs/2309.13625, 2023 DOI: 10.48550/ARXIV.2309.13625
- “The artbench dataset: Benchmarking generative models with artworks” In arXiv preprint arXiv:2206.11404, 2022
- “Microsoft coco: Common objects in context” In European conference on computer vision, 2014, pp. 740–755 Springer
- “ConceptNet–a practical commonsense reasoning tool-kit” In BT technology journal 22.4 Springer, 2004, pp. 211–226
- “Deepfashion: Powering robust clothes recognition and retrieval with rich annotations” In Proceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 1096–1104
- “Collective activity detection using hinge-loss Markov random fields” In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, 2013, pp. 566–571
- Kenneth Marino, Ruslan Salakhutdinov and Abhinav Gupta “The More You Know: Using Knowledge Graphs for Image Classification” In 2017 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2017, Honolulu, HI, USA, July 21-26, 2017 IEEE Computer Society, 2017, pp. 20–28 DOI: 10.1109/CVPR.2017.10
- D.S. Martinez Pandiani and V. Presutti “Automatic Modeling of Social Concepts Evoked by Art Images as Multimodal Frames” In Proceedings of the Workshops and Tutorials held at LDK 2021 co-located with the 3rd Language, Data and Knowledge Conference (LDK 2021), 2021, pp. arXiv–2110
- D.S. Martinez Pandiani and V. Presutti “Seeing the Intangible: Survey of Image Classification into High-Level and Abstract Categories” In arXiv preprint arXiv:2308.10562, 2023
- D.S. Martinez Pandiani and V. Presutti “Situated Ground Truths: Enhancing Bias-Aware AI by Situating Data Labels with SituAnnotate” In [Under Review] Special Issue on Trustworthy Artificial Intelligence of ACM Transactions on Knowledge Discovery from Data (TKDD), 2024
- “Hypericons for Interpretability: Decoding Abstract Concepts in Visual Data” In International Journal of Digital Humanities (IJDH), 2023
- “Relative representations enable zero-shot latent space communication” In The Eleventh International Conference on Learning Representations, 2022
- “ASIF: Coupled Data Turns Unimodal Models to Multimodal Without Training” In CoRR abs/2210.01738, 2022 DOI: 10.48550/ARXIV.2210.01738
- “CHiLS: Zero-Shot Image Classification with Hierarchical Label Sets” In International Conference on Machine Learning, ICML 2023, 23-29 July 2023, Honolulu, Hawaii, USA 202, Proceedings of Machine Learning Research PMLR, 2023, pp. 26342–26362
- “Grounded Situation Recognition” In Computer Vision – ECCV 2020, Lecture Notes in Computer Science Cham: Springer International Publishing, 2020, pp. 314–332 Springer DOI: 10.1007/978-3-030-58548-8“˙19
- Mohammad Amin Sadeghi and Ali Farhadi “Recognition using visual phrases” In Cvpr 2011, 2011, pp. 1745–1752 Ieee
- Cristina Segalin, Dong Seon Cheng and Marco Cristani “Social Profiling through Image Understanding: Personality Inference Using Convolutional Neural Networks” In Computer Vision and Image Understanding 156, Image and Video Understanding in Big Data, 2017, pp. 34–50 DOI: 10.1016/j.cviu.2016.10.013
- “Very Deep Convolutional Networks for Large-Scale Image Recognition” In 3rd International Conference on Learning Representations, ICLR 2015, San Diego, CA, USA, May 7-9, 2015, Conference Track Proceedings, 2015
- Robyn Speer, Joshua Chin and Catherine Havasi “Conceptnet 5.5: An open multilingual graph of general knowledge” In Thirty-first AAAI Conference on Artificial Intelligence, 2017
- “Mixture-Kernel Graph Attention Network for Situation Recognition” In 2019 IEEE/CVF International Conference on Computer Vision (ICCV) Seoul, Korea (South): IEEE, 2019, pp. 10362–10371 DOI: 10.1109/ICCV.2019.01046
- Qianru Sun, Bernt Schiele and Mario Fritz “A Domain Based Approach to Social Relation Recognition” In 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR) Honolulu, HI: IEEE, 2017, pp. 435–444 DOI: 10.1109/CVPR.2017.54
- Richard Szeliski “Computer vision: algorithms and applications” Springer Nature, 2022
- “Knowledge graphs as tools for explainable machine learning: A survey” In Artificial Intelligence 302 Elsevier, 2022, pp. 103627
- “Estimation of Continuous Valence and Arousal Levels from Faces in Naturalistic Conditions” In Nature Machine Intelligence 3.1, 2021, pp. 42–50 DOI: 10.1038/s42256-020-00280-0
- “The representation of abstract words: What matters? Reply to Paivio’s (2013) comment on Kousta et al.(2011).” American Psychological Association, 2013
- “Knowledge graph embedding: A survey of approaches and applications” In IEEE Transactions on Knowledge and Data Engineering 29.12 IEEE, 2017, pp. 2724–2743
- Scott Workman, Richard Souvenir and Nathan Jacobs “Understanding and Mapping Natural Beauty” In 2017 IEEE International Conference on Computer Vision (ICCV) Venice: IEEE, 2017, pp. 5590–5599 DOI: 10.1109/ICCV.2017.596
- “Attention-Aware Polarity Sensitive Embedding for Affective Image Retrieval” In 2019 IEEE/CVF International Conference on Computer Vision (ICCV) Seoul, Korea (South): IEEE, 2019, pp. 1140–1150 DOI: 10.1109/ICCV.2019.00123
- Mark Yatskar, Luke Zettlemoyer and Ali Farhadi “Situation Recognition: Visual Semantic Role Labeling for Image Understanding” In 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR) Las Vegas, NV, USA: IEEE, 2016, pp. 5534–5542 DOI: 10.1109/CVPR.2016.597
- “ADVISE: Symbolism and External Knowledge for Decoding Advertisements” In Computer Vision – ECCV 2018 11219 LNCS Cham: Springer International Publishing, 2018, pp. 868–886 DOI: 10.1007/978-3-030-01267-0“˙51
- “Interpreting the Rhetoric of Visual Advertisements” In IEEE Transactions on Pattern Analysis and Machine Intelligence 43.4, 2019, pp. 1308–1323 DOI: 10.1109/TPAMI.2019.2947440
- “Scaling Vision Transformers” In IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2022, New Orleans, LA, USA, June 18-24, 2022 IEEE, 2022, pp. 1204–1213 DOI: 10.1109/CVPR52688.2022.01179
- Yuke Zhu, Alireza Fathi and Li Fei-Fei “Reasoning about object affordances in a knowledge base representation” In European conference on computer vision, 2014, pp. 408–424 Springer