Improving Visual Recognition with Hyperbolical Visual Hierarchy Mapping (2404.00974v1)
Abstract: Visual scenes are naturally organized in a hierarchy, where a coarse semantic is recursively comprised of several fine details. Exploring such a visual hierarchy is crucial to recognize the complex relations of visual elements, leading to a comprehensive scene understanding. In this paper, we propose a Visual Hierarchy Mapper (Hi-Mapper), a novel approach for enhancing the structured understanding of the pre-trained Deep Neural Networks (DNNs). Hi-Mapper investigates the hierarchical organization of the visual scene by 1) pre-defining a hierarchy tree through the encapsulation of probability densities; and 2) learning the hierarchical relations in hyperbolic space with a novel hierarchical contrastive loss. The pre-defined hierarchy tree recursively interacts with the visual features of the pre-trained DNNs through hierarchy decomposition and encoding procedures, thereby effectively identifying the visual hierarchy and enhancing the recognition of an entire scene. Extensive experiments demonstrate that Hi-Mapper significantly enhances the representation capability of DNNs, leading to an improved performance on various tasks, including image classification and dense prediction tasks.
- Hyperbolic vision transformers: Combining improvements in metric learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 7409–7419, 2022.
- Hier: Metric learning beyond class labels via hierarchical regularization. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 19903–19912, 2023.
- Detecting and recognizing human-object interactions. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 8359–8367, 2018.
- Knowing where to focus: Event-aware transformer for video grounding. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 13846–13856, 2023.
- Affordance transfer learning for human-object interaction detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 495–504, 2021.
- Probabilistic prompt learning for dense prediction. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 6768–6777, 2023.
- Pin the memory: Learning to generalize semantic segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 4350–4360, 2022.
- An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929, 2020.
- Stand-alone self-attention in vision models. Advances in neural information processing systems, 32, 2019.
- Exploring self-attention for image recognition. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 10076–10085, 2020.
- Cmt: Convolutional neural networks meet vision transformers. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 12175–12185, 2022.
- Cvt: Introducing convolutions to vision transformers. In Proceedings of the IEEE/CVF international conference on computer vision, pages 22–31, 2021.
- Cswin transformer: A general vision transformer backbone with cross-shaped windows. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 12124–12134, 2022.
- Pyramid vision transformer: A versatile backbone for dense prediction without convolutions. In Proceedings of the IEEE/CVF international conference on computer vision, pages 568–578, 2021.
- Mvitv2: Improved multiscale vision transformers for classification and detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 4804–4814, 2022a.
- Crossvit: Cross-attention multi-scale vision transformer for image classification. In Proceedings of the IEEE/CVF international conference on computer vision, pages 357–366, 2021.
- Beyond fixation: Dynamic window visual transformer. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 11987–11997, 2022.
- Quadtree attention for vision transformers. arXiv preprint arXiv:2201.02767, 2022.
- Visual dependency transformers: Dependency tree emerges from reversed attention. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 14528–14539, 2023.
- Learning hierarchical image segmentation for recognition and by recognition. In The Twelfth International Conference on Learning Representations, 2023.
- The geometry of graphs and some of its algorithmic applications. In Proceedings 35th Annual Symposium on Foundations of Computer Science, pages 577–591, 1994. doi: 10.1109/SFCS.1994.365733.
- Curvature regularization to prevent distortion in graph embedding. Advances in Neural Information Processing Systems, 33:20779–20790, 2020.
- Poincaré embeddings for learning hierarchical representations. Advances in neural information processing systems, 30, 2017a.
- Learning continuous hierarchies in the lorentz model of hyperbolic geometry. In International conference on machine learning, pages 3779–3788. PMLR, 2018.
- Curvature generation in curved spaces for few-shot learning. In Proceedings of the IEEE/CVF international conference on computer vision, pages 8691–8700, 2021.
- Poincar\\\backslash\’e glove: Hyperbolic word embeddings. arXiv preprint arXiv:1810.06546, 2018.
- Hypertext: Endowing fasttext with hyperbolic geometry. arXiv preprint arXiv:2010.16143, 2020.
- Hyperbolic graph convolutional neural networks. Advances in neural information processing systems, 32, 2019.
- Hyperbolic image-text representations. In International Conference on Machine Learning, pages 7694–7731. PMLR, 2023.
- Word representations via gaussian embedding. In International Conference on Learning Representations, 2015.
- Multimodal word distributions. arXiv preprint arXiv:1704.08424, 2017.
- Hierarchical density order embeddings. In International Conference on Learning Representations, 2018.
- Probabilistic modeling of semantic ambiguity for scene graph generation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 12527–12536, 2021.
- Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016.
- Training data-efficient image transformers & distillation through attention. In International conference on machine learning, pages 10347–10357. PMLR, 2021.
- Imagenet: A large-scale hierarchical image database. In 2009 IEEE conference on computer vision and pattern recognition, pages 248–255. Ieee, 2009.
- Microsoft coco: Common objects in context. In Computer Vision–ECCV 2014: 13th European Conference, Zurich, Switzerland, September 6-12, 2014, Proceedings, Part V 13, pages 740–755. Springer, 2014.
- Scene parsing through ade20k dataset. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 633–641, 2017.
- Object detection with discriminatively trained part-based models. IEEE transactions on pattern analysis and machine intelligence, 32(9):1627–1645, 2009.
- Bottom-up/top-down image parsing with attribute grammar. IEEE transactions on pattern analysis and machine intelligence, 31(1):59–73, 2008.
- Learning hierarchical models of scenes, objects, and parts. In Tenth IEEE International Conference on Computer Vision (ICCV’05) Volume 1, volume 2, pages 1331–1338. IEEE, 2005.
- Image parsing: Unifying segmentation, detection, and recognition. International Journal of computer vision, 63:113–140, 2005.
- A numerical study of the bottom-up and top-down inference processes in and-or graphs. International journal of computer vision, 93:226–252, 2011.
- Learning compositional neural information fusion for human parsing. In Proceedings of the IEEE/CVF international conference on computer vision, pages 5703–5713, 2019.
- Hierarchical human parsing with typed part-relation reasoning. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 8929–8939, 2020.
- Unsupervised part discovery by unsupervised disentanglement. In Pattern Recognition: 42nd DAGM German Conference, DAGM GCPR 2020, Tübingen, Germany, September 28–October 1, 2020, Proceedings 42, pages 345–359. Springer, 2021.
- Unsupervised part discovery from contrastive reconstruction. Advances in Neural Information Processing Systems, 34:28104–28118, 2021.
- Scops: Self-supervised co-part segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 869–878, 2019.
- Learning hierarchical image segmentation for recognition and by recognition. In The Twelfth International Conference on Learning Representations, 2024.
- Probabilistic embeddings for cross-modal retrieval. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 8415–8424, 2021.
- Probabilistic face embeddings. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 6902–6911, 2019.
- Probabilistic representations for video contrastive learning. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 14711–14721, 2022.
- Poincaré embeddings for learning hierarchical representations. Advances in neural information processing systems, 30, 2017b.
- Hyperbolic image segmentation. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 4453–4462, 2022.
- Unsupervised discovery of the long-tail in instance segmentation using hierarchical self-supervision. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 2603–2612, 2021.
- Hyperbolic image embeddings. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 6418–6428, 2020.
- Variational dropout and the local reparameterization trick. Advances in neural information processing systems, 28, 2015.
- Representation learning with contrastive predictive coding. arXiv preprint arXiv:1807.03748, 2018.
- Efficientnet: Rethinking model scaling for convolutional neural networks. In International conference on machine learning, pages 6105–6114. PMLR, 2019.
- Swin transformer: Hierarchical vision transformer using shifted windows. In Proceedings of the IEEE/CVF international conference on computer vision, pages 10012–10022, 2021.
- Pvt v2: Improved baselines with pyramid vision transformer. Computational Visual Media, 8(3):415–424, 2022.
- Davit: Dual attention vision transformers. In European Conference on Computer Vision, pages 74–92. Springer, 2022.
- Multi-scale vision longformer: A new vision transformer for high-resolution image encoding. In Proceedings of the IEEE/CVF international conference on computer vision, pages 2998–3008, 2021.
- Focal loss for dense object detection. In Proceedings of the IEEE international conference on computer vision, pages 2980–2988, 2017.
- Augment your batch: Improving generalization through instance repetition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 8129–8138, 2020.
- Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101, 2017.
- Mask r-cnn. In Proceedings of the IEEE international conference on computer vision, pages 2961–2969, 2017.
- Exploring plain vision transformer backbones for object detection. In European Conference on Computer Vision, pages 280–296. Springer, 2022b.
- Panoptic feature pyramid networks. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 6399–6408, 2019.
- Unified perceptual parsing for scene understanding. In Proceedings of the European conference on computer vision (ECCV), pages 418–434, 2018.