Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
131 tokens/sec
GPT-4o
10 tokens/sec
Gemini 2.5 Pro Pro
47 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Improving Visual Recognition with Hyperbolical Visual Hierarchy Mapping (2404.00974v1)

Published 1 Apr 2024 in cs.CV

Abstract: Visual scenes are naturally organized in a hierarchy, where a coarse semantic is recursively comprised of several fine details. Exploring such a visual hierarchy is crucial to recognize the complex relations of visual elements, leading to a comprehensive scene understanding. In this paper, we propose a Visual Hierarchy Mapper (Hi-Mapper), a novel approach for enhancing the structured understanding of the pre-trained Deep Neural Networks (DNNs). Hi-Mapper investigates the hierarchical organization of the visual scene by 1) pre-defining a hierarchy tree through the encapsulation of probability densities; and 2) learning the hierarchical relations in hyperbolic space with a novel hierarchical contrastive loss. The pre-defined hierarchy tree recursively interacts with the visual features of the pre-trained DNNs through hierarchy decomposition and encoding procedures, thereby effectively identifying the visual hierarchy and enhancing the recognition of an entire scene. Extensive experiments demonstrate that Hi-Mapper significantly enhances the representation capability of DNNs, leading to an improved performance on various tasks, including image classification and dense prediction tasks.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (70)
  1. Hyperbolic vision transformers: Combining improvements in metric learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 7409–7419, 2022.
  2. Hier: Metric learning beyond class labels via hierarchical regularization. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 19903–19912, 2023.
  3. Detecting and recognizing human-object interactions. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 8359–8367, 2018.
  4. Knowing where to focus: Event-aware transformer for video grounding. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 13846–13856, 2023.
  5. Affordance transfer learning for human-object interaction detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 495–504, 2021.
  6. Probabilistic prompt learning for dense prediction. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 6768–6777, 2023.
  7. Pin the memory: Learning to generalize semantic segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 4350–4360, 2022.
  8. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929, 2020.
  9. Stand-alone self-attention in vision models. Advances in neural information processing systems, 32, 2019.
  10. Exploring self-attention for image recognition. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 10076–10085, 2020.
  11. Cmt: Convolutional neural networks meet vision transformers. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 12175–12185, 2022.
  12. Cvt: Introducing convolutions to vision transformers. In Proceedings of the IEEE/CVF international conference on computer vision, pages 22–31, 2021.
  13. Cswin transformer: A general vision transformer backbone with cross-shaped windows. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 12124–12134, 2022.
  14. Pyramid vision transformer: A versatile backbone for dense prediction without convolutions. In Proceedings of the IEEE/CVF international conference on computer vision, pages 568–578, 2021.
  15. Mvitv2: Improved multiscale vision transformers for classification and detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 4804–4814, 2022a.
  16. Crossvit: Cross-attention multi-scale vision transformer for image classification. In Proceedings of the IEEE/CVF international conference on computer vision, pages 357–366, 2021.
  17. Beyond fixation: Dynamic window visual transformer. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 11987–11997, 2022.
  18. Quadtree attention for vision transformers. arXiv preprint arXiv:2201.02767, 2022.
  19. Visual dependency transformers: Dependency tree emerges from reversed attention. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 14528–14539, 2023.
  20. Learning hierarchical image segmentation for recognition and by recognition. In The Twelfth International Conference on Learning Representations, 2023.
  21. The geometry of graphs and some of its algorithmic applications. In Proceedings 35th Annual Symposium on Foundations of Computer Science, pages 577–591, 1994. doi: 10.1109/SFCS.1994.365733.
  22. Curvature regularization to prevent distortion in graph embedding. Advances in Neural Information Processing Systems, 33:20779–20790, 2020.
  23. Poincaré embeddings for learning hierarchical representations. Advances in neural information processing systems, 30, 2017a.
  24. Learning continuous hierarchies in the lorentz model of hyperbolic geometry. In International conference on machine learning, pages 3779–3788. PMLR, 2018.
  25. Curvature generation in curved spaces for few-shot learning. In Proceedings of the IEEE/CVF international conference on computer vision, pages 8691–8700, 2021.
  26. Poincar\\\backslash\’e glove: Hyperbolic word embeddings. arXiv preprint arXiv:1810.06546, 2018.
  27. Hypertext: Endowing fasttext with hyperbolic geometry. arXiv preprint arXiv:2010.16143, 2020.
  28. Hyperbolic graph convolutional neural networks. Advances in neural information processing systems, 32, 2019.
  29. Hyperbolic image-text representations. In International Conference on Machine Learning, pages 7694–7731. PMLR, 2023.
  30. Word representations via gaussian embedding. In International Conference on Learning Representations, 2015.
  31. Multimodal word distributions. arXiv preprint arXiv:1704.08424, 2017.
  32. Hierarchical density order embeddings. In International Conference on Learning Representations, 2018.
  33. Probabilistic modeling of semantic ambiguity for scene graph generation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 12527–12536, 2021.
  34. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016.
  35. Training data-efficient image transformers & distillation through attention. In International conference on machine learning, pages 10347–10357. PMLR, 2021.
  36. Imagenet: A large-scale hierarchical image database. In 2009 IEEE conference on computer vision and pattern recognition, pages 248–255. Ieee, 2009.
  37. Microsoft coco: Common objects in context. In Computer Vision–ECCV 2014: 13th European Conference, Zurich, Switzerland, September 6-12, 2014, Proceedings, Part V 13, pages 740–755. Springer, 2014.
  38. Scene parsing through ade20k dataset. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 633–641, 2017.
  39. Object detection with discriminatively trained part-based models. IEEE transactions on pattern analysis and machine intelligence, 32(9):1627–1645, 2009.
  40. Bottom-up/top-down image parsing with attribute grammar. IEEE transactions on pattern analysis and machine intelligence, 31(1):59–73, 2008.
  41. Learning hierarchical models of scenes, objects, and parts. In Tenth IEEE International Conference on Computer Vision (ICCV’05) Volume 1, volume 2, pages 1331–1338. IEEE, 2005.
  42. Image parsing: Unifying segmentation, detection, and recognition. International Journal of computer vision, 63:113–140, 2005.
  43. A numerical study of the bottom-up and top-down inference processes in and-or graphs. International journal of computer vision, 93:226–252, 2011.
  44. Learning compositional neural information fusion for human parsing. In Proceedings of the IEEE/CVF international conference on computer vision, pages 5703–5713, 2019.
  45. Hierarchical human parsing with typed part-relation reasoning. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 8929–8939, 2020.
  46. Unsupervised part discovery by unsupervised disentanglement. In Pattern Recognition: 42nd DAGM German Conference, DAGM GCPR 2020, Tübingen, Germany, September 28–October 1, 2020, Proceedings 42, pages 345–359. Springer, 2021.
  47. Unsupervised part discovery from contrastive reconstruction. Advances in Neural Information Processing Systems, 34:28104–28118, 2021.
  48. Scops: Self-supervised co-part segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 869–878, 2019.
  49. Learning hierarchical image segmentation for recognition and by recognition. In The Twelfth International Conference on Learning Representations, 2024.
  50. Probabilistic embeddings for cross-modal retrieval. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 8415–8424, 2021.
  51. Probabilistic face embeddings. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 6902–6911, 2019.
  52. Probabilistic representations for video contrastive learning. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 14711–14721, 2022.
  53. Poincaré embeddings for learning hierarchical representations. Advances in neural information processing systems, 30, 2017b.
  54. Hyperbolic image segmentation. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 4453–4462, 2022.
  55. Unsupervised discovery of the long-tail in instance segmentation using hierarchical self-supervision. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 2603–2612, 2021.
  56. Hyperbolic image embeddings. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 6418–6428, 2020.
  57. Variational dropout and the local reparameterization trick. Advances in neural information processing systems, 28, 2015.
  58. Representation learning with contrastive predictive coding. arXiv preprint arXiv:1807.03748, 2018.
  59. Efficientnet: Rethinking model scaling for convolutional neural networks. In International conference on machine learning, pages 6105–6114. PMLR, 2019.
  60. Swin transformer: Hierarchical vision transformer using shifted windows. In Proceedings of the IEEE/CVF international conference on computer vision, pages 10012–10022, 2021.
  61. Pvt v2: Improved baselines with pyramid vision transformer. Computational Visual Media, 8(3):415–424, 2022.
  62. Davit: Dual attention vision transformers. In European Conference on Computer Vision, pages 74–92. Springer, 2022.
  63. Multi-scale vision longformer: A new vision transformer for high-resolution image encoding. In Proceedings of the IEEE/CVF international conference on computer vision, pages 2998–3008, 2021.
  64. Focal loss for dense object detection. In Proceedings of the IEEE international conference on computer vision, pages 2980–2988, 2017.
  65. Augment your batch: Improving generalization through instance repetition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 8129–8138, 2020.
  66. Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101, 2017.
  67. Mask r-cnn. In Proceedings of the IEEE international conference on computer vision, pages 2961–2969, 2017.
  68. Exploring plain vision transformer backbones for object detection. In European Conference on Computer Vision, pages 280–296. Springer, 2022b.
  69. Panoptic feature pyramid networks. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 6399–6408, 2019.
  70. Unified perceptual parsing for scene understanding. In Proceedings of the European conference on computer vision (ECCV), pages 418–434, 2018.
Citations (1)

Summary

We haven't generated a summary for this paper yet.