Towards Visual Syntactical Understanding (2401.17497v1)
Abstract: Syntax is usually studied in the realm of linguistics and refers to the arrangement of words in a sentence. Similarly, an image can be considered as a visual 'sentence', with the semantic parts of the image acting as 'words'. While visual syntactic understanding occurs naturally to humans, it is interesting to explore whether deep neural networks (DNNs) are equipped with such reasoning. To that end, we alter the syntax of natural images (e.g. swapping the eye and nose of a face), referred to as 'incorrect' images, to investigate the sensitivity of DNNs to such syntactic anomaly. Through our experiments, we discover an intriguing property of DNNs where we observe that state-of-the-art convolutional neural networks, as well as vision transformers, fail to discriminate between syntactically correct and incorrect images when trained on only correct ones. To counter this issue and enable visual syntactic understanding with DNNs, we propose a three-stage framework- (i) the 'words' (or the sub-features) in the image are detected, (ii) the detected words are sequentially masked and reconstructed using an autoencoder, (iii) the original and reconstructed parts are compared at each location to determine syntactic correctness. The reconstruction module is trained with BERT-like masked autoencoding for images, with the motivation to leverage LLM inspired training to better capture the syntax. Note, our proposed approach is unsupervised in the sense that the incorrect images are only used during testing and the correct versus incorrect labels are never used for training. We perform experiments on CelebA, and AFHQ datasets and obtain classification accuracy of 92.10%, and 90.89%, respectively. Notably, the approach generalizes well to ImageNet samples which share common classes with CelebA and AFHQ without explicitly training on them.
- Nicole Harms. Syntax vs Semantics. https://becomeawritertoday.com/syntax-vs-semantics/, 2021. [Online; accessed 23-February-2023].
- A syntactic-semantic approach to image understanding and creation. IEEE Transactions on Pattern Analysis and Machine Intelligence, (2):135–144, 1979.
- King Sun Fu. Syntactic methods in pattern recognition. Elsevier, 1974.
- Michael L Baird. A paradigm for semantic picture recognition. In Proceedings of the ACM annual conference, pages 430–6, 1973.
- Imagenet classification with deep convolutional neural networks. In Advances in neural information processing systems, pages 1097–1105, 2012.
- Fully convolutional networks for semantic segmentation. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 3431–3440, 2015.
- Do neural language models show preferences for syntactic formalisms? arXiv preprint arXiv:2004.14096, 2020.
- Targeted syntactic evaluation of language models. arXiv preprint arXiv:1808.09031, 2018.
- Masked autoencoders are scalable vision learners. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 16000–16009, 2022.
- Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805, 2018.
- What do you learn from context? probing for sentence structure in contextualized word representations. arXiv preprint arXiv:1905.06316, 2019.
- Dissecting contextual word embeddings: Architecture and representation. arXiv preprint arXiv:1808.08949, 2018.
- Deep contextualized word representations. corr abs/1802.05365 (2018). arXiv preprint arXiv:1802.05365, 1802.
- Blimp: The benchmark of linguistic minimal pairs for english. Transactions of the Association for Computational Linguistics, 8:377–392, 2020.
- Refining targeted syntactic evaluation of language models. arXiv preprint arXiv:2104.09635, 2021.
- A stochastic grammar of images. Foundations and Trends® in Computer Graphics and Vision, 2(4):259–362, 2007.
- Forms: a flexible object recognition and modelling system. International journal of computer vision, 20(3):187–212, 1996.
- Ulf Grenander. General pattern theory: A mathematical study of regular structures. Oxford University Press on Demand, 1993.
- Beit: Bert pre-training of image transformers. arXiv preprint arXiv:2106.08254, 2021.
- Support vector method for novelty detection. Advances in neural information processing systems, 12, 1999.
- Deep one-class classification. In International conference on machine learning, pages 4393–4402. PMLR, 2018.
- Deep structured energy based models for anomaly detection. In International conference on machine learning, pages 1100–1109. PMLR, 2016.
- Unsupervised anomaly detection with generative adversarial networks to guide marker discovery. In Information Processing in Medical Imaging: 25th International Conference, IPMI 2017, Boone, NC, USA, June 25-30, 2017, Proceedings, pages 146–157. Springer, 2017.
- Ganomaly: Semi-supervised anomaly detection via adversarial training. In Computer Vision–ACCV 2018: 14th Asian Conference on Computer Vision, Perth, Australia, December 2–6, 2018, Revised Selected Papers, Part III 14, pages 622–637. Springer, 2019.
- Anomaly detection with generative adversarial networks for multivariate time series. arXiv preprint arXiv:1809.04758, 2018.
- Emanuel Parzen. On estimation of a probability density function and mode. The annals of mathematical statistics, 33(3):1065–1076, 1962.
- Learning transferable visual models from natural language supervision. In International conference on machine learning, pages 8748–8763. PMLR, 2021.
- Rethinking and improving relative position encoding for vision transformer. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 10033–10041, 2021.
- Faster r-cnn: Towards real-time object detection with region proposal networks. Advances in neural information processing systems, 28, 2015.
- Language models are unsupervised multitask learners. OpenAI blog, 1(8):9, 2019.
- An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929, 2020.
- Large-scale celebfaces attributes (celeba) dataset. Retrieved August, 15(2018):11, 2018.
- Stargan v2: Diverse image synthesis for multiple domains. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 8188–8197, 2020.
- Motunrayo Olugbenga. Balanced Accuracy. https://neptune.ai/blog/balanced-accuracy/, 2023. [Online; accessed 23-February-2023].
- A simple unified framework for detecting out-of-distribution samples and adversarial attacks. Advances in neural information processing systems, 31, 2018.
- Clipcap: Clip prefix for image captioning. arXiv preprint arXiv:2111.09734, 2021.
- Simplenet: A simple network for image anomaly detection and localization. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 20402–20411, 2023.
- Self-supervised normalizing flows for image anomaly detection and localization. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 2926–2935, 2023.
- Deep anomaly detection using geometric transformations. Advances in neural information processing systems, 31, 2018.
- Probabilistic modeling of deep features for out-of-distribution and adversarial detection. arXiv preprint arXiv:1909.11786, 2019.
- Attention-conditioned augmentations for self-supervised anomaly detection and localization. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 37, pages 14720–14728, 2023.
- Deep autoencoding gaussian mixture model for unsupervised anomaly detection. In International conference on learning representations, 2018.
- Omni-frequency channel-selection representations for unsupervised anomaly detection. IEEE Transactions on Image Processing, 2023.
- Stacked convolutional auto-encoders for hierarchical feature extraction. In Artificial Neural Networks and Machine Learning–ICANN 2011: 21st International Conference on Artificial Neural Networks, Espoo, Finland, June 14-17, 2011, Proceedings, Part I 21, pages 52–59. Springer, 2011.
- Unsupervised surface anomaly detection with diffusion probabilistic model. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 6782–6791, 2023.
- Laurens Van der Maaten and Geoffrey Hinton. Visualizing data using t-sne. Journal of machine learning research, 9(11), 2008.
- Image retrieval using scene graphs. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 3668–3678, 2015.
- Scene graph generation by iterative message passing. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 5410–5419, 2017.
- Daniel Kahneman. Thinking, fast and slow. macmillan, 2011.
- Wikipedia. Neuro-symbolic AI. https://https://en.wikipedia.org/wiki/Neuro-symbolic_AI/, 2023. [Online; accessed 23-February-2023].
- Neurosymbolic systems of perception & cognition: The role of attention. Frontiers in Psychology, page 2105, 2022.
Collections
Sign up for free to add this paper to one or more collections.