Semantic-Syntactic Discrepancy in Images (SSDI): Learning Meaning and Order of Features from Natural Images (2401.17515v2)
Abstract: Despite considerable progress in image classification tasks, classification models seem unaffected by the images that significantly deviate from those that appear natural to human eyes. Specifically, while human perception can easily identify abnormal appearances or compositions in images, classification models overlook any alterations in the arrangement of object parts as long as they are present in any order, even if unnatural. Hence, this work exposes the vulnerability of having semantic and syntactic discrepancy in images (SSDI) in the form of corruptions that remove or shuffle image patches or present images in the form of puzzles. To address this vulnerability, we propose the concept of "image grammar", comprising "image semantics" and "image syntax". Image semantics pertains to the interpretation of parts or patches within an image, whereas image syntax refers to the arrangement of these parts to form a coherent object. We present a semi-supervised two-stage method for learning the image grammar of visual elements and environments solely from natural images. While the first stage learns the semantic meaning of individual object parts, the second stage learns how their relative arrangement constitutes an entire object. The efficacy of the proposed approach is then demonstrated by achieving SSDI detection rates ranging from 70% to 90% on corruptions generated from CelebA and SUN-RGBD datasets. Code is publicly available at: https://github.com/ChunTao1999/SSDI/
- Embedding contrastive unsupervised features to cluster in- and out-of-distribution noise in corrupted image datasets. In Computer Vision – ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXI, page 402–419, Berlin, Heidelberg, 2022. Springer-Verlag.
- Self-labelling via simultaneous clustering and representation learning. In ICLR. OpenReview.net, 2020.
- Yoshua Bengio. Learning deep architectures for ai. Found. Trends Mach. Learn., 2(1):1–127, 2009.
- Representation learning: A review and new perspectives. IEEE Trans. Pattern Anal. Mach. Intell., 35(8):1798–1828, 2013.
- Stefano F. Cappa. Imaging semantics and syntax. NeuroImage, 61(2):427–431, 2012. NEUROIMAGING: THEN, NOW AND THE FUTURE.
- Deep clustering for unsupervised learning of visual features. In Proceedings of the European Conference on Computer Vision (ECCV), 2018.
- Emerging properties in self-supervised vision transformers, 2021.
- Anomaly detection: A survey. ACM Comput. Surv., 41(3), 2009.
- Picie: Unsupervised semantic segmentation using invariance and equivariance in clustering. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 16794–16804, 2021.
- Evaluating weakly supervised object localization methods right. In CVPR, pages 3130–3139. IEEE, 2020.
- Imagenet: A large-scale hierarchical image database. In 2009 IEEE conference on computer vision and pattern recognition, pages 248–255. Ieee, 2009.
- An image is worth 16x16 words: Transformers for image recognition at scale. CoRR, abs/2010.11929, 2020.
- An image is worth 16x16 words: Transformers for image recognition at scale. ICLR, 2021.
- Are vision transformers robust to spurious correlations?, 2022.
- Bootstrap your own latent a new approach to self-supervised learning. In Proceedings of the 34th International Conference on Neural Information Processing Systems, Red Hook, NY, USA, 2020. Curran Associates Inc.
- Are vision transformers robust to patch perturbations?, 2022.
- When syntax meets semantics. Psychophysiology, 34(6):660–676, 1997.
- Deep clustering with convolutional autoencoders. In ICONIP (2), pages 373–382. Springer, 2017.
- Improving robustness of vision transformers by reducing sensitivity to patch corruptions. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 4108–4118, 2023.
- Unsupervised semantic segmentation by distilling feature correspondences, 2022.
- Scenenet: an annotated model generator for indoor scene understanding. In ICRA, 2016.
- Deep residual learning for image recognition. In 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 770–778, 2016.
- Momentum contrast for unsupervised visual representation learning, 2019. cite arxiv:1911.05722Comment: CVPR 2020 camera-ready. Code: https://github.com/facebookresearch/moco.
- Lahnet: A convolutional neural network fusing low- and high-level features for aerial scene classification. In IGARSS 2018 - 2018 IEEE International Geoscience and Remote Sensing Symposium, pages 4728–4731, 2018.
- Exploring foveation and saccade for improved weakly-supervised localization. In NeuRIPS 2023 Workshop on Gaze Meets ML, 2023.
- Invariant information clustering for unsupervised image classification and segmentation. In ICCV, pages 9864–9873. IEEE, 2019.
- Rednet: Residual encoder-decoder network for indoor rgb-d semantic segmentation. arXiv preprint arXiv:1806.01054, 2018.
- Analyzing and improving the image quality of StyleGAN. In Proc. CVPR, 2020.
- Novel dataset for fine-grained image categorization. In First Workshop on Fine-Grained Visual Categorization, IEEE Conference on Computer Vision and Pattern Recognition, Colorado Springs, CO, 2011.
- Panoptic feature pyramid networks. In CVPR, pages 6399–6408. Computer Vision Foundation / IEEE, 2019.
- Imagenet classification with deep convolutional neural networks. In Advances in neural information processing systems, pages 1097–1105, 2012.
- Maskgan: Towards diverse and interactive facial image manipulation. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2020.
- Semantic segmentation with generative models: Semi-supervised learning and strong out-of-domain generalization. In Conference on Computer Vision and Pattern Recognition (CVPR), 2021.
- Uniformer: Unified transformer for efficient spatiotemporal representation learning. CoRR, abs/2201.04676, 2022.
- Feature pyramid networks for object detection. In CVPR, pages 936–944. IEEE Computer Society, 2017.
- Recurrent multimodal interaction for referring image segmentation. In ICCV, pages 1280–1289. IEEE Computer Society, 2017.
- Deep learning face attributes in the wild. In Proceedings of International Conference on Computer Vision (ICCV), 2015.
- Swin transformer: Hierarchical vision transformer using shifted windows. 2021 IEEE/CVF International Conference on Computer Vision (ICCV), pages 9992–10002, 2021.
- Recurrent models of visual attention. In NIPS, pages 2204–2212, 2014.
- Towards robust learning with different label noise distributions. In 2020 25th International Conference on Pattern Recognition (ICPR), pages 7020–7027, 2021.
- Pytorch: An imperative style, high-performance deep learning library. In Advances in Neural Information Processing Systems 32, pages 8024–8035. Curran Associates, Inc., 2019.
- Unsupervised single-scene semantic segmentation for earth observation. IEEE Transactions on Geoscience and Remote Sensing, 60:1–11, 2022.
- Feature Extraction for Content-Based Image Retrieval, pages 1115–1119. Springer US, Boston, MA, 2009.
- Deep convolution neural network with scene-centric and object-centric information for object detection. Image and Vision Computing, 85:14–25, 2019.
- Very deep convolutional networks for large-scale image recognition. In ICLR, 2015.
- Sun rgb-d: A rgb-d scene understanding benchmark suite. In CVPR, pages 567–576. IEEE Computer Society, 2015.
- How to train your vit? data, augmentation, and regularization in vision transformers. Transactions on Machine Learning Research, 2022.
- Deeper insights into the robustness of vits towards common corruptions, 2022.
- Unsupervised semantic segmentation by contrasting object mask proposals. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 10052–10062, 2021.
- Arnim von Stechow. Syntax and semantics: an overview. Semantics-Interfaces, page 169, 2019.
- Semantic-aware auto-encoders for self-supervised representation learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 9664–9675, 2022.
- Multi-view clustering and feature learning via structured sparsity. In Proceedings of the 30th International Conference on Machine Learning, pages 352–360, Atlanta, Georgia, USA, 2013. PMLR.
- Conditional negative sampling for contrastive learning of visual representations. CoRR, abs/2010.02037, 2020.
- Ngc: A unified framework for learning with open-world noisy data. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 62–71, 2021.
- Gan-based anomaly detection: A review. Neurocomput., 493(C):497–535, 2022.
- Unsupervised deep embedding for clustering analysis. In ICML, pages 478–487. JMLR.org, 2016.
- Show, attend and tell: Neural image caption generation with visual attention. In Proceedings of the 32nd International Conference on Machine Learning, pages 2048–2057, Lille, France, 2015. PMLR.
- Hirl: A general framework for hierarchical image representation learning, 2022.
- Joint unsupervised learning of deep representations and image clusters. In CVPR, pages 5147–5156. IEEE Computer Society, 2016.
- Dformer: Rethinking rgbd representation learning for semantic segmentation. arXiv preprint arXiv:2309.09668, 2023.
- Patch-level representation learning for self-supervised vision transformers. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 8354–8363, 2022.
- Online deep clustering for unsupervised representation learning. In CVPR, pages 6687–6696. IEEE, 2020.
- Understanding deep learning (still) requires rethinking generalization. Commun. ACM, 64(3):107–115, 2021.
- Richard Zhang. Making convolutional networks shift-invariant again. In Proceedings of the 36th International Conference on Machine Learning, pages 7324–7334. PMLR, 2019.
- Understanding the robustness in vision transformers. In Proceedings of the 39th International Conference on Machine Learning, pages 27378–27394. PMLR, 2022.
Collections
Sign up for free to add this paper to one or more collections.