Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
120 tokens/sec
GPT-4o
10 tokens/sec
Gemini 2.5 Pro Pro
44 tokens/sec
o3 Pro
5 tokens/sec
GPT-4.1 Pro
3 tokens/sec
DeepSeek R1 via Azure Pro
55 tokens/sec
2000 character limit reached

Learning Hierarchical Image Segmentation For Recognition and By Recognition (2210.00314v4)

Published 1 Oct 2022 in cs.CV, cs.AI, and cs.LG

Abstract: Large vision and LLMs learned directly through image-text associations often lack detailed visual substantiation, whereas image segmentation tasks are treated separately from recognition, supervisedly learned without interconnections. Our key observation is that, while an image can be recognized in multiple ways, each has a consistent part-and-whole visual organization. Segmentation thus should be treated not as an end task to be mastered through supervised learning, but as an internal process that evolves with and supports the ultimate goal of recognition. We propose to integrate a hierarchical segmenter into the recognition process, train and adapt the entire model solely on image-level recognition objectives. We learn hierarchical segmentation for free alongside recognition, automatically uncovering part-to-whole relationships that not only underpin but also enhance recognition. Enhancing the Vision Transformer (ViT) with adaptive segment tokens and graph pooling, our model surpasses ViT in unsupervised part-whole discovery, semantic segmentation, image classification, and efficiency. Notably, our model (trained on unlabeled 1M ImageNet images) outperforms SAM (trained on 11M images and 1 billion masks) by absolute 8% in mIoU on PartImageNet object segmentation.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (110)
  1. Slic superpixels compared to state-of-the-art superpixel methods. PAMI, 2012.
  2. Gpt-4 technical report. arXiv preprint arXiv:2303.08774, 2023.
  3. Reverse hierarchies and sensory learning. Philosophical Transactions of the Royal Society B: Biological Sciences, 364(1515):285–299, 2009. doi: 10.1098/rstb.2008.0253.
  4. Learning pixel-level semantic affinity with image-level supervision for weakly supervised semantic segmentation. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp.  4981–4990, 2018.
  5. Densepose: Dense human pose estimation in the wild. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp.  7297–7306, 2018.
  6. Contour detection and hierarchical image segmentation. IEEE transactions on pattern analysis and machine intelligence, 33(5):898–916, 2010.
  7. Clustering on the unit hypersphere using von mises-fisher distributions. Journal of Machine Learning Research, 6(Sep):1345–1382, 2005.
  8. Seeds: Superpixels extracted via energy-driven sampling. In European conference on computer vision, pp.  13–26. Springer, 2012.
  9. Irving Biederman. Recognition-by-components: a theory of human image understanding. Psychological review, 94(2):115, 1987.
  10. Token merging: Your vit but faster. arXiv preprint arXiv:2210.09461, 2022.
  11. Token merging: Your vit but faster. In International Conference on Learning Representations, 2023.
  12. Albert S. Bregman. Asking the "what for" question in auditory perception. In Michael Kubovy and James R. Pomerantz (eds.), Perceptual Organization, pp.  99–118. Erlbaum, Hillsdale, NJ, 1981.
  13. Emerging properties in self-supervised vision transformers. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp.  9650–9660, 2021.
  14. Deeplab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs. arXiv preprint arXiv:1606.00915, 2016.
  15. A simple framework for contrastive learning of visual representations. In International conference on machine learning, pp.  1597–1607. PMLR, 2020.
  16. An empirical study of training self-supervised vision transformers. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp.  9640–9649, 2021.
  17. Masked-attention mask transformer for universal image segmentation. arXiv, 2021.
  18. Mean shift: A robust approach toward feature space analysis. PAMI, 2002.
  19. Randaugment: Practical automated data augmentation with a reduced search space. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition workshops, pp.  702–703, 2020.
  20. Part-aware panoptic segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.  5485–5494, 2021.
  21. Imagenet: A large-scale hierarchical image database. In 2009 IEEE conference on computer vision and pattern recognition, pp.  248–255. Ieee, 2009.
  22. Cswin transformer: A general vision transformer backbone with cross-shaped windows. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.  12124–12134, 2022.
  23. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929, 2020.
  24. Attend, infer, repeat: Fast scene understanding with generative models. Advances in neural information processing systems, 29, 2016.
  25. The pascal visual object classes (voc) challenge. IJCV, 2010.
  26. Efficient graph-based image segmentation. IJCV, 2004.
  27. Class segmentation and object localization with superpixel neighborhoods. In 2009 IEEE 12th international conference on computer vision, pp.  670–677. IEEE, 2009.
  28. Superpixel convolutional networks using bilateral inceptions. In ECCV, 2016.
  29. Multi-class segmentation with relative location prior. International journal of computer vision, 80(3):300–316, 2008.
  30. Power-bert: Accelerating bert inference via progressive word-vector elimination. In International Conference on Machine Learning, pp.  3690–3699. PMLR, 2020.
  31. A survey on vision transformer. IEEE transactions on pattern analysis and machine intelligence, 45(1):87–110, 2022.
  32. Semantic contours from inverse detectors. In 2011 International Conference on Computer Vision, pp.  991–998. IEEE, 2011.
  33. Hypercolumns for object segmentation and fine-grained localization. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp.  447–456, 2015.
  34. Partimagenet: A large, high-quality dataset of parts. In European Conference on Computer Vision, pp.  128–145, 2022.
  35. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp.  770–778, 2016.
  36. Momentum contrast for unsupervised visual representation learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.  9729–9738, 2020.
  37. Rethinking spatial dimensions of vision transformers. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp.  11936–11945, 2021.
  38. Pixel-wise deep learning for contour detection. arXiv preprint arXiv:1504.01989, 2015.
  39. Segsort: Segmentation by discriminative sorting of segments. In ICCV, 2019.
  40. Invariant information clustering for unsupervised image classification and segmentation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp.  9865–9874, 2019.
  41. Discriminative clustering for image co-segmentation. In CVPR, pp.  1943–1950. IEEE, 2010.
  42. Oamixer: Object-aware mixing layer for vision transformers. CVPR Transformers for Vision Workshop, 2022.
  43. Unsupervised hierarchical semantic segmentation with multiview cosegmentation and clustering transformers. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.  2571–2581, 2022.
  44. Segment anything. arXiv preprint arXiv:2304.02643, 2023.
  45. Semantic-sam: Segment and recognize anything at any granularity. arXiv preprint arXiv:2307.04767, 2023.
  46. Deep hierarchical semantic segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.  1246–1257, 2022a.
  47. Panoptic-partformer: Learning a unified model for panoptic part segmentation. In European Conference on Computer Vision, pp.  729–747. Springer, 2022b.
  48. Exploring plain vision transformer backbones for object detection. In European Conference on Computer Vision, pp.  280–296. Springer, 2022c.
  49. Open-vocabulary semantic segmentation with mask-adapted clip. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.  7061–7070, 2023.
  50. Microsoft coco: Common objects in context. In European conference on computer vision, pp.  740–755. Springer, 2014.
  51. Feature pyramid networks for object detection. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp.  2117–2125, 2017.
  52. Swin transformer: Hierarchical vision transformer using shifted windows. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp.  10012–10022, 2021.
  53. Object-centric learning with slot attention. Advances in Neural Information Processing Systems, 33:11525–11538, 2020.
  54. Fully convolutional networks for semantic segmentation. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp.  3431–3440, 2015.
  55. Image as set of points. In International Conference on Learning Representations, 2023.
  56. Michael Maire. Simultaneous segmentation and figure/ground organization using angular embedding. In European Conference on Computer Vision, pp.  450–464. Springer, 2010.
  57. Object detection and segmentation from joint embedding of parts and pixels. In 2011 International Conference on Computer Vision, pp.  2142–2149. IEEE, 2011.
  58. Contour and texture analysis for image segmentation. IJCV, 2001.
  59. Token pooling in vision transformers. arXiv preprint arXiv:2110.03860, 2021.
  60. Deepfakes generation and detection: State-of-the-art, open challenges, countermeasures, and way forward. Applied intelligence, 53(4):3974–4026, 2023.
  61. The many faces of configural processing. Trends in cognitive sciences, 6(6):255–260, 2002.
  62. Instagan: Instance-aware image-to-image translation. International Conference on Learning Representations, 2019.
  63. Object-aware contrastive learning for debiased scene representation. Advances in Neural Information Processing Systems, 34:12251–12264, 2021.
  64. Recovering human body configurations: Combining segmentation and recognition. In Proceedings of the 2004 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 2004. CVPR 2004., volume 2, pp.  II–II. IEEE, 2004.
  65. The role of context for object detection and semantic segmentation in the wild. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp.  891–898, 2014.
  66. Autoregressive unsupervised image segmentation. In Proceedings of the European Conference on Computer Vision (ECCV), August 2020.
  67. Towards open-world segmentation of parts. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.  15392–15401, 2023.
  68. Pointnet++: Deep hierarchical feature learning on point sets in a metric space. Advances in neural information processing systems, 30, 2017.
  69. Aims: All-inclusive multi-level segmentation for anything. Advances in Neural Information Processing Systems, 36, 2024.
  70. Learning transferable visual models from natural language supervision. In International conference on machine learning, pp.  8748–8763. PMLR, 2021.
  71. Dynamicvit: Efficient vision transformers with dynamic token sparsification. Advances in neural information processing systems, 34, 2021.
  72. Learning a classification model for segmentation. In Computer Vision, IEEE International Conference on, volume 2, pp.  10–10. IEEE Computer Society, 2003.
  73. Region-based saliency detection and its application in object recognition. IEEE Transactions on Circuits and Systems for Video Technology, 24(5):769–779, 2013.
  74. Efficient parameter-free clustering using first neighbor relations. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp.  8934–8943, 2019.
  75. Grad-cam: Visual explanations from deep networks via gradient-based localization. In Proceedings of the IEEE international conference on computer vision, pp.  618–626, 2017.
  76. Casting your model: Learning to localize improves self-supervised representations. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.  11058–11067, 2021.
  77. Recursive context propagation network for semantic scene labeling. Advances in Neural Information Processing Systems, 27, 2014.
  78. Hierarchy and adaptivity in segmenting visual scenes. Nature, 442(7104):810–813, 2006.
  79. Top-down visual attention from analysis by synthesis. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.  2102–2112, 2023.
  80. Normalized cuts and image segmentation. IEEE Transactions on pattern analysis and machine intelligence, 22(8):888–905, 2000.
  81. Segmenter: Transformer for semantic segmentation. In Proceedings of the IEEE/CVF international conference on computer vision, pp.  7262–7272, 2021.
  82. Going denser with open-vocabulary part segmentation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp.  15453–15465, 2023.
  83. Test-time training with self-supervision for generalization under distribution shifts. In International conference on machine learning, pp.  9229–9248. PMLR, 2020.
  84. Rethinking the inception architecture for computer vision. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp.  2818–2826, 2016.
  85. Parts and wholes in face recognition. The Quarterly Journal of Experimental Psychology Section A, 46(2):225–245, 1993.
  86. The “parts and wholes” of face recognition: A review of the literature. Quarterly Journal of Experimental Psychology, 69(10):1876–1889, 2016.
  87. Part and whole face representations in immediate and long-term memory. Vision Research, 164:53–61, 2019.
  88. Contrastive multiview coding. In European conference on computer vision, pp.  776–794. Springer, 2020.
  89. Training data-efficient image transformers & distillation through attention. In International Conference on Machine Learning, pp.  10347–10357. PMLR, 2021.
  90. Revisiting contrastive methods for unsupervised learning of visual representations. arXiv preprint arXiv:2106.05967, 2021.
  91. Tent: Fully test-time adaptation by entropy minimization. International Conference on Learning Representations, 2021a.
  92. Dense contrastive learning for self-supervised visual pre-training. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.  3024–3033, 2021b.
  93. Unsupervised feature learning by cross-level instance-group discrimination. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.  12586–12595, 2021c.
  94. Hierarchical open-vocabulary universal image segmentation. Advances in Neural Information Processing Systems, 36, 2024.
  95. Ov-parts: Towards open-vocabulary part segmentation. Advances in Neural Information Processing Systems, 36, 2024.
  96. Superpixel hierarchy. IEEE Transactions on Image Processing, 27(10):4838–4849, 2018.
  97. Unsupervised feature learning via non-parametric instance-level discrimination. arXiv preprint arXiv:1805.01978, 2018.
  98. Early convolutions help transformers see better. Advances in Neural Information Processing Systems, 34, 2021.
  99. Holistically-nested edge detection. In ICCV, 2015.
  100. Groupvit: Semantic segmentation emerges from text supervision. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.  18134–18144, 2022.
  101. Multiclass spectral clustering. In ICCV, 2003a.
  102. Object-specific figure-ground segregation. In 2003 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 2003. Proceedings., volume 2, pp.  II–39. IEEE, 2003b.
  103. Segmentation given partial grouping constraints. PAMI, 2004.
  104. Concurrent object recognition and segmentation by graph partitioning. Advances in neural information processing systems, 15, 2002.
  105. Cutmix: Regularization strategy to train strong classifiers with localizable features. In Proceedings of the IEEE/CVF international conference on computer vision, pp.  6023–6032, 2019.
  106. Not all tokens are equal: Human-centric visual analysis via token clustering transformer. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.  11101–11111, 2022.
  107. mixup: Beyond empirical risk minimization. arXiv preprint arXiv:1710.09412, 2017.
  108. Self-supervised visual representation learning from hierarchical grouping. Advances in Neural Information Processing Systems, 33, 2020.
  109. Semantic segmentation by early region proxy. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.  1258–1268, 2022.
  110. Semantic understanding of scenes through the ade20k dataset. International Journal of Computer Vision, 127:302–321, 2019.
Citations (4)

Summary

We haven't generated a summary for this paper yet.

Dice Question Streamline Icon: https://streamlinehq.com

Follow-up Questions

We haven't generated follow-up questions for this paper yet.

X Twitter Logo Streamline Icon: https://streamlinehq.com