Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
136 tokens/sec
GPT-4o
11 tokens/sec
Gemini 2.5 Pro Pro
47 tokens/sec
o3 Pro
5 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Stitching Gaps: Fusing Situated Perceptual Knowledge with Vision Transformers for High-Level Image Classification (2402.19339v1)

Published 29 Feb 2024 in cs.CV and cs.AI

Abstract: The increasing demand for automatic high-level image understanding, particularly in detecting abstract concepts (AC) within images, underscores the necessity for innovative and more interpretable approaches. These approaches need to harmonize traditional deep vision methods with the nuanced, context-dependent knowledge humans employ to interpret images at intricate semantic levels. In this work, we leverage situated perceptual knowledge of cultural images to enhance performance and interpretability in AC image classification. We automatically extract perceptual semantic units from images, which we then model and integrate into the ARTstract Knowledge Graph (AKG). This resource captures situated perceptual semantics gleaned from over 14,000 cultural images labeled with ACs. Additionally, we enhance the AKG with high-level linguistic frames. We compute KG embeddings and experiment with relative representations and hybrid approaches that fuse these embeddings with visual transformer embeddings. Finally, for interpretability, we conduct posthoc qualitative analyses by examining model similarities with training instances. Our results show that our hybrid KGE-ViT methods outperform existing techniques in AC image classification. The posthoc interpretability analyses reveal the visual transformer's proficiency in capturing pixel-level visual attributes, contrasting with our method's efficacy in representing more abstract and semantic scene elements. We demonstrate the synergy and complementarity between KGE embeddings' situated perceptual knowledge and deep visual model's sensory-perceptual understanding for AC image classification. This work suggests a strong potential of neuro-symbolic methods for knowledge integration and robust image representation for use in downstream intricate visual comprehension tasks. All the materials and code are available online.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (76)
  1. “ArtEmis: Affective Language for Visual Art” In IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2021, virtual, June 19-25, 2021 Nashville, TN, USA: Computer Vision Foundation / IEEE, 2021, pp. 11569–11579 DOI: 10.1109/CVPR46437.2021.01140
  2. Somak Aditya, Yezhou Yang and Chitta Baral “Explicit reasoning over end-to-end neural architectures for visual question answering” In Proceedings of the AAAI Conference on Artificial Intelligence 32, 2018
  3. Somak Aditya, Yezhou Yang and Chitta Baral “Integrating knowledge and reasoning in image understanding” In 28th International Joint Conference on Artificial Intelligence, IJCAI 2019, 2019, pp. 6252–6259 International Joint Conferences on Artificial Intelligence
  4. “A public domain dataset for human activity recognition using smartphones.” In Esann 3, 2013, pp. 3
  5. “Distant Viewing Toolkit: A Python Package for the Analysis of Visual Culture” In Journal of Open Source Software 5.45, 2020, pp. 1800 DOI: 10.21105/joss.01800
  6. “Modular Design Patterns for Hybrid Learning and Reasoning Systems: a taxonomy, patterns and use cases” In arXiv:2102.11965 [cs] 51.9 Springer, 2021, pp. 6528–6546
  7. “A Survey on Word Meta-Embedding Learning” In Proceedings of the Thirty-First International Joint Conference on Artificial Intelligence, IJCAI 2022, Vienna, Austria, 23-29 July 2022 ijcai.org, 2022, pp. 5402–5409 DOI: 10.24963/IJCAI.2022/758
  8. “Translating embeddings for modeling multi-relational data” In Advances in neural information processing systems 26, 2013
  9. Ali Borji “Negative results in computer vision: A perspective” In Image and Vision Computing 69 Elsevier, 2018, pp. 1–8
  10. Jerome Bruner “Culture and human development: A new look” In Human development 33.6 Karger Publishers, 1990, pp. 344–355
  11. “Scalable Theory-Driven Regularization of Scene Graph Generation Models” In Thirty-Seventh AAAI Conference on Artificial Intelligence, AAAI 2023, Thirty-Fifth Conference on Innovative Applications of Artificial Intelligence, IAAI 2023, Thirteenth Symposium on Educational Advances in Artificial Intelligence, EAAI 2023, Washington, DC, USA, February 7-14, 2023 AAAI Press, 2023, pp. 6850–6859 DOI: 10.1609/AAAI.V37I6.25839
  12. “End-to-end object detection with transformers” In European conference on computer vision, 2020, pp. 213–229 Springer
  13. “Iterative visual reasoning beyond convolutions” In Proc. of CVPR 2018, 2018, pp. 7239–7248 IEEE
  14. “Automated multimodal sensemaking: Ontology-based integration of linguistic frames and visual data” In Computers in Human Behavior 150, 2024, pp. 107997 DOI: https://doi.org/10.1016/j.chb.2023.107997
  15. Sebastian J Crutch, Basil H Ridha and Elizabeth K Warrington “The different frameworks underlying abstract and concrete knowledge: Evidence from a bilingual patient with a semantic refractory access dysphasia” In Neurocase 12.3 Taylor & Francis, 2006, pp. 151–163
  16. Stamatia Dasiopoulou, Ioannis Kompatsiaris and Michael G Strintzis “Applying fuzzy DLs in the extraction of image semantics” In Journal on data semantics XIV Springer, 2009, pp. 105–132
  17. “Qualitative differences in the representation of abstract versus concrete words: Evidence from the visual-world paradigm” In Cognition 110.2 Elsevier, 2009, pp. 284–292
  18. “Multimodal learning with graphs” In Nat. Mac. Intell. 5.4, 2023, pp. 340–350 DOI: 10.1038/S42256-023-00624-6
  19. Chaz Firestone and Brian J Scholl “Cognition does not affect perception: Evaluating the evidence for “top-down” effects” In Behavioral and brain sciences 39 Cambridge University Press, 2016
  20. “Framester: A wide coverage linguistic linked data hub” In European Knowledge Acquisition Workshop, Lecture Notes in Computer Science Cham: Springer International Publishing, 2016, pp. 239–254 Springer DOI: 10.1007/978-3-319-49004-5“˙16
  21. Arushi Goel, Keng Teck Ma and Cheston Tan “An End-To-End Network for Generating Social Relationship Graphs” In 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Long Beach, CA, USA: IEEE, 2019, pp. 11178–11187 DOI: 10.1109/CVPR.2019.01144
  22. “Predicting Facial Beauty without Landmarks” In Computer Vision – ECCV 2010, Lecture Notes in Computer Science Berlin, Heidelberg: Springer, 2010, pp. 434–447 DOI: 10.1007/978-3-642-15567-3“˙32
  23. Meiqi Guo, Rebecca Hwa and Adriana Kovashka “Detecting Persuasive Atypicality by Modeling Contextual Compatibility” In 2021 IEEE/CVF International Conference on Computer Vision (ICCV) Montreal, QC, Canada: IEEE, 2021, pp. 952–962 DOI: 10.1109/ICCV48922.2021.00101
  24. Wenzhong Guo, Jianwen Wang and Shiping Wang “Deep multimodal representation learning: A survey” In IEEE Access 7 IEEE, 2019, pp. 63373–63394
  25. Catherine Havasi, Robert Speer and Jason Alonso “ConceptNet 3: a flexible, multilingual semantic network for common sense knowledge” In Recent advances in natural language processing, 2007, pp. 27–29 John Benjamins Philadelphia, PA
  26. “Deep Residual Learning for Image Recognition” In 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR) Las Vegas, NV, USA: IEEE, 2016, pp. 770–778 DOI: 10.1109/CVPR.2016.90
  27. Paul Hoffman “Concepts, control, and context: A connectionist account of normal and disordered semantic cognition.” In Psychological Review 125.3, 2018, pp. 293 DOI: 10.1037/rev0000094
  28. Derek Hoiem, Alexei A Efros and Martial Hebert “Putting objects in perspective” In International Journal of Computer Vision 80 Springer, 2008, pp. 3–15
  29. “Inferring Visual Persuasion via Body Language, Setting, and Deep Features” In IEEE Computer Society Conference on Computer Vision and Pattern Recognition Workshops, 2016, pp. 778–784 DOI: 10.1109/CVPRW.2016.102
  30. “Automatic Understanding of Image and Video Advertisements” In Proceedings of the IEEE conference on computer vision and pattern recognition, 2017, pp. 1705–1715
  31. “Automatic understanding of image and video advertisements” In Proceedings of the IEEE conference on computer vision and pattern recognition, 2017, pp. 1705–1715
  32. Phillip Isola, Joseph J Lim and Edward H Adelson “Discovering states and transformations in image collections” In Proceedings of the IEEE conference on computer vision and pattern recognition, 2015, pp. 1383–1391
  33. “A Review on Methods and Applications in Multimodal Deep Learning” In ACM Trans. Multim. Comput. Commun. Appl. 19.2s, 2023, pp. 76:1–76:41 DOI: 10.1145/3545572
  34. “Intentonomy: a Dataset and Study towards Human Intent Understanding” In 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Nashville, TN, USA: IEEE, 2021, pp. 12981–12991 DOI: 10.1109/CVPR46437.2021.01279
  35. “Visual Persuasion: Inferring Communicative Intents of Images” In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2014, pp. 216–223
  36. “Symbolic image detection using scene and knowledge graphs” In arXiv preprint arXiv:2206.04863, 2022
  37. “Fairface: Face attribute dataset for balanced race, gender, and age for bias measurement and mitigation” In Proceedings of the IEEE/CVF winter conference on applications of computer vision, 2021, pp. 1548–1558
  38. “The representation of abstract words: Why emotion matters” In Journal of Experimental Psychology: General 140.1 American Psychological Association, 2011, pp. 14–34 DOI: 10.1037/a0021446
  39. “Visual Genome: Connecting Language and Vision Using Crowdsourced Dense Image Annotations” In arXiv:1602.07332 [cs] 123.1 Springer, 2016, pp. 32–73
  40. “Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation” In International Conference on Machine Learning, 2022, pp. 12888–12900 PMLR
  41. “Dual-Glance Model for Deciphering Social Relationships” In 2017 IEEE International Conference on Computer Vision (ICCV) Venice: IEEE, 2017, pp. 2669–2678 DOI: 10.1109/ICCV.2017.289
  42. “Situation Recognition with Graph Neural Networks” In 2017 IEEE International Conference on Computer Vision (ICCV) Venice: IEEE, 2017, pp. 4183–4192 DOI: 10.1109/ICCV.2017.448
  43. “Graph-Based Social Relation Reasoning” In Computer Vision – ECCV 2020, Lecture Notes in Computer Science Cham: Springer International Publishing, 2020, pp. 18–34 DOI: 10.1007/978-3-030-58555-6“˙2
  44. “GraphAdapter: Tuning Vision-Language Models With Dual Knowledge Graph” In CoRR abs/2309.13625, 2023 DOI: 10.48550/ARXIV.2309.13625
  45. “The artbench dataset: Benchmarking generative models with artworks” In arXiv preprint arXiv:2206.11404, 2022
  46. “Microsoft coco: Common objects in context” In European conference on computer vision, 2014, pp. 740–755 Springer
  47. “ConceptNet–a practical commonsense reasoning tool-kit” In BT technology journal 22.4 Springer, 2004, pp. 211–226
  48. “Deepfashion: Powering robust clothes recognition and retrieval with rich annotations” In Proceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 1096–1104
  49. “Collective activity detection using hinge-loss Markov random fields” In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, 2013, pp. 566–571
  50. Kenneth Marino, Ruslan Salakhutdinov and Abhinav Gupta “The More You Know: Using Knowledge Graphs for Image Classification” In 2017 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2017, Honolulu, HI, USA, July 21-26, 2017 IEEE Computer Society, 2017, pp. 20–28 DOI: 10.1109/CVPR.2017.10
  51. D.S. Martinez Pandiani and V. Presutti “Automatic Modeling of Social Concepts Evoked by Art Images as Multimodal Frames” In Proceedings of the Workshops and Tutorials held at LDK 2021 co-located with the 3rd Language, Data and Knowledge Conference (LDK 2021), 2021, pp. arXiv–2110
  52. D.S. Martinez Pandiani and V. Presutti “Seeing the Intangible: Survey of Image Classification into High-Level and Abstract Categories” In arXiv preprint arXiv:2308.10562, 2023
  53. D.S. Martinez Pandiani and V. Presutti “Situated Ground Truths: Enhancing Bias-Aware AI by Situating Data Labels with SituAnnotate” In [Under Review] Special Issue on Trustworthy Artificial Intelligence of ACM Transactions on Knowledge Discovery from Data (TKDD), 2024
  54. “Hypericons for Interpretability: Decoding Abstract Concepts in Visual Data” In International Journal of Digital Humanities (IJDH), 2023
  55. “Relative representations enable zero-shot latent space communication” In The Eleventh International Conference on Learning Representations, 2022
  56. “ASIF: Coupled Data Turns Unimodal Models to Multimodal Without Training” In CoRR abs/2210.01738, 2022 DOI: 10.48550/ARXIV.2210.01738
  57. “CHiLS: Zero-Shot Image Classification with Hierarchical Label Sets” In International Conference on Machine Learning, ICML 2023, 23-29 July 2023, Honolulu, Hawaii, USA 202, Proceedings of Machine Learning Research PMLR, 2023, pp. 26342–26362
  58. “Grounded Situation Recognition” In Computer Vision – ECCV 2020, Lecture Notes in Computer Science Cham: Springer International Publishing, 2020, pp. 314–332 Springer DOI: 10.1007/978-3-030-58548-8“˙19
  59. Mohammad Amin Sadeghi and Ali Farhadi “Recognition using visual phrases” In Cvpr 2011, 2011, pp. 1745–1752 Ieee
  60. Cristina Segalin, Dong Seon Cheng and Marco Cristani “Social Profiling through Image Understanding: Personality Inference Using Convolutional Neural Networks” In Computer Vision and Image Understanding 156, Image and Video Understanding in Big Data, 2017, pp. 34–50 DOI: 10.1016/j.cviu.2016.10.013
  61. “Very Deep Convolutional Networks for Large-Scale Image Recognition” In 3rd International Conference on Learning Representations, ICLR 2015, San Diego, CA, USA, May 7-9, 2015, Conference Track Proceedings, 2015
  62. Robyn Speer, Joshua Chin and Catherine Havasi “Conceptnet 5.5: An open multilingual graph of general knowledge” In Thirty-first AAAI Conference on Artificial Intelligence, 2017
  63. “Mixture-Kernel Graph Attention Network for Situation Recognition” In 2019 IEEE/CVF International Conference on Computer Vision (ICCV) Seoul, Korea (South): IEEE, 2019, pp. 10362–10371 DOI: 10.1109/ICCV.2019.01046
  64. Qianru Sun, Bernt Schiele and Mario Fritz “A Domain Based Approach to Social Relation Recognition” In 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR) Honolulu, HI: IEEE, 2017, pp. 435–444 DOI: 10.1109/CVPR.2017.54
  65. Richard Szeliski “Computer vision: algorithms and applications” Springer Nature, 2022
  66. “Knowledge graphs as tools for explainable machine learning: A survey” In Artificial Intelligence 302 Elsevier, 2022, pp. 103627
  67. “Estimation of Continuous Valence and Arousal Levels from Faces in Naturalistic Conditions” In Nature Machine Intelligence 3.1, 2021, pp. 42–50 DOI: 10.1038/s42256-020-00280-0
  68. “The representation of abstract words: What matters? Reply to Paivio’s (2013) comment on Kousta et al.(2011).” American Psychological Association, 2013
  69. “Knowledge graph embedding: A survey of approaches and applications” In IEEE Transactions on Knowledge and Data Engineering 29.12 IEEE, 2017, pp. 2724–2743
  70. Scott Workman, Richard Souvenir and Nathan Jacobs “Understanding and Mapping Natural Beauty” In 2017 IEEE International Conference on Computer Vision (ICCV) Venice: IEEE, 2017, pp. 5590–5599 DOI: 10.1109/ICCV.2017.596
  71. “Attention-Aware Polarity Sensitive Embedding for Affective Image Retrieval” In 2019 IEEE/CVF International Conference on Computer Vision (ICCV) Seoul, Korea (South): IEEE, 2019, pp. 1140–1150 DOI: 10.1109/ICCV.2019.00123
  72. Mark Yatskar, Luke Zettlemoyer and Ali Farhadi “Situation Recognition: Visual Semantic Role Labeling for Image Understanding” In 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR) Las Vegas, NV, USA: IEEE, 2016, pp. 5534–5542 DOI: 10.1109/CVPR.2016.597
  73. “ADVISE: Symbolism and External Knowledge for Decoding Advertisements” In Computer Vision – ECCV 2018 11219 LNCS Cham: Springer International Publishing, 2018, pp. 868–886 DOI: 10.1007/978-3-030-01267-0“˙51
  74. “Interpreting the Rhetoric of Visual Advertisements” In IEEE Transactions on Pattern Analysis and Machine Intelligence 43.4, 2019, pp. 1308–1323 DOI: 10.1109/TPAMI.2019.2947440
  75. “Scaling Vision Transformers” In IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2022, New Orleans, LA, USA, June 18-24, 2022 IEEE, 2022, pp. 1204–1213 DOI: 10.1109/CVPR52688.2022.01179
  76. Yuke Zhu, Alireza Fathi and Li Fei-Fei “Reasoning about object affordances in a knowledge base representation” In European conference on computer vision, 2014, pp. 408–424 Springer
Citations (1)

Summary

We haven't generated a summary for this paper yet.