Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
119 tokens/sec
GPT-4o
56 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

DatUS^2: Data-driven Unsupervised Semantic Segmentation with Pre-trained Self-supervised Vision Transformer (2401.12820v1)

Published 23 Jan 2024 in cs.CV and cs.LG

Abstract: Successive proposals of several self-supervised training schemes continue to emerge, taking one step closer to developing a universal foundation model. In this process, the unsupervised downstream tasks are recognized as one of the evaluation methods to validate the quality of visual features learned with a self-supervised training scheme. However, unsupervised dense semantic segmentation has not been explored as a downstream task, which can utilize and evaluate the quality of semantic information introduced in patch-level feature representations during self-supervised training of a vision transformer. Therefore, this paper proposes a novel data-driven approach for unsupervised semantic segmentation (DatUS2) as a downstream task. DatUS2 generates semantically consistent and dense pseudo annotate segmentation masks for the unlabeled image dataset without using any visual-prior or synchronized data. We compare these pseudo-annotated segmentation masks with ground truth masks for evaluating recent self-supervised training schemes to learn shared semantic properties at the patch level and discriminative semantic properties at the segment level. Finally, we evaluate existing state-of-the-art self-supervised training schemes with our proposed downstream task, i.e., DatUS2. Also, the best version of DatUS2 outperforms the existing state-of-the-art method for the unsupervised dense semantic segmentation task with 15.02% MiOU and 21.47% Pixel accuracy on the SUIM dataset. It also achieves a competitive level of accuracy for a large-scale and complex dataset, i.e., the COCO dataset.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (45)
  1. K. Stefanov, J. Beskow, and G. Salvi, “Self-supervised vision-based detection of the active speaker as support for socially aware language acquisition,” IEEE Transactions on Cognitive and Developmental Systems, vol. 12, no. 2, pp. 250–259, 2019.
  2. L. Jing and Y. Tian, “Self-supervised visual feature learning with deep neural networks: A survey,” IEEE transactions on pattern analysis and machine intelligence, vol. 43, no. 11, pp. 4037–4058, 2020.
  3. C.-Y. Zhang, H.-Y. Yao, C. P. Chen, and Y.-N. Lin, “Graph representation learning via contrasting cluster assignments,” IEEE Transactions on Cognitive and Developmental Systems, 2023.
  4. P. Goyal, Q. Duval, I. Seessel, M. Caron, I. Misra, L. Sagun, A. Joulin, and P. Bojanowski, “Vision models are more robust and fair when pretrained on uncurated images without supervision,” arXiv preprint arXiv:2202.08360, 2022.
  5. T. Chen, S. Kornblith, M. Norouzi, and G. Hinton, “A simple framework for contrastive learning of visual representations,” in International conference on machine learning.   PMLR, 2020, pp. 1597–1607.
  6. M. Caron, I. Misra, J. Mairal, P. Goyal, P. Bojanowski, and A. Joulin, “Unsupervised learning of visual features by contrasting cluster assignments,” Advances in Neural Information Processing Systems, vol. 33, pp. 9912–9924, 2020.
  7. K. He, H. Fan, Y. Wu, S. Xie, and R. Girshick, “Momentum contrast for unsupervised visual representation learning,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2020, pp. 9729–9738.
  8. J. Fang and G. Liu, “Self-supervised learning of depth and ego-motion from videos by alternative training and geometric constraints from 3-d to 2-d,” IEEE Transactions on Cognitive and Developmental Systems, vol. 15, no. 1, pp. 223–233, 2022.
  9. A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin, “Attention is all you need,” Advances in neural information processing systems, vol. 30, 2017.
  10. J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, “Bert: Pre-training of deep bidirectional transformers for language understanding,” arXiv preprint arXiv:1810.04805, 2018.
  11. M. Caron, H. Touvron, I. Misra, H. Jégou, J. Mairal, P. Bojanowski, and A. Joulin, “Emerging properties in self-supervised vision transformers,” in Proceedings of the IEEE/CVF international conference on computer vision, 2021, pp. 9650–9660.
  12. J. Zhou, C. Wei, H. Wang, W. Shen, C. Xie, A. Yuille, and T. Kong, “ibot: Image bert pre-training with online tokenizer,” arXiv preprint arXiv:2111.07832, 2021.
  13. P. Zhou, Y. Zhou, C. Si, W. Yu, T. K. Ng, and S. Yan, “Mugs: A multi-granular self-supervised learning framework,” arXiv preprint arXiv:2203.14415, 2022.
  14. X. An, J. Deng, K. Yang, J. Li, Z. Feng, J. Guo, J. Yang, and T. Liu, “Unicom: Universal and compact representation learning for image retrieval,” arXiv preprint arXiv:2304.05884, 2023.
  15. A. Bar, X. Wang, V. Kantorov, C. J. Reed, R. Herzig, G. Chechik, A. Rohrbach, T. Darrell, and A. Globerson, “Detreg: Unsupervised pretraining with region priors for object detection,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 14 605–14 615.
  16. E. Zheltonozhskii, C. Baskin, A. M. Bronstein, and A. Mendelson, “Self-supervised learning for large-scale unsupervised image clustering,” arXiv preprint arXiv:2008.10312, 2020.
  17. Y. Wang, W. Zhuo, Y. Li, Z. Wang, Q. Ju, and W. Zhu, “Fully self-supervised learning for semantic segmentation,” arXiv preprint arXiv:2202.11981, 2022.
  18. X. Ji, J. F. Henriques, and A. Vedaldi, “Invariant information clustering for unsupervised image classification and segmentation,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2019, pp. 9865–9874.
  19. A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai, T. Unterthiner, M. Dehghani, M. Minderer, G. Heigold, S. Gelly et al., “An image is worth 16x16 words: Transformers for image recognition at scale,” arXiv preprint arXiv:2010.11929, 2020.
  20. T.-W. Ke, J.-J. Hwang, Y. Guo, X. Wang, and S. X. Yu, “Unsupervised hierarchical semantic segmentation with multiview cosegmentation and clustering transformers,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 2571–2581.
  21. J. Ji, S. Li, X. Liao, and F. Zhang, “Semantic segmentation based on spatial pyramid pooling and multi-layer feature fusion,” IEEE Transactions on Cognitive and Developmental Systems, 2022.
  22. O. Siméoni, G. Puy, H. V. Vo, S. Roburin, S. Gidaris, A. Bursuc, P. Pérez, R. Marlet, and J. Ponce, “Localizing objects with self-supervised transformers and no labels,” arXiv preprint arXiv:2109.14279, 2021.
  23. O. Siméoni, C. Sekkat, G. Puy, A. Vobeckỳ, É. Zablocki, and P. Pérez, “Unsupervised object localization: Observing the background to discover objects,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 3176–3186.
  24. Y. Wang, X. Shen, S. X. Hu, Y. Yuan, J. L. Crowley, and D. Vaufreydaz, “Self-supervised transformers for unsupervised object discovery using normalized cut,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 14 543–14 553.
  25. S. Lim, J. Park, M. Lee, and H. Lee, “K-means for unsupervised instance segmentation using a self-supervised transformer,” Available at SSRN 4251338.
  26. W. Van Gansbeke, S. Vandenhende, and L. Van Gool, “Discovering object masks with transformers for unsupervised semantic segmentation,” arXiv preprint arXiv:2206.06363, 2022.
  27. A. Vobecky, D. Hurych, O. Siméoni, S. Gidaris, A. Bursuc, P. Pérez, and J. Sivic, “Drive&segment: Unsupervised semantic segmentation of urban scenes via cross-modal distillation,” in Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXVIII.   Springer, 2022, pp. 478–495.
  28. M. Oquab, T. Darcet, T. Moutakanni, H. Vo, M. Szafraniec, V. Khalidov, P. Fernandez, D. Haziza, F. Massa, A. El-Nouby et al., “Dinov2: Learning robust visual features without supervision,” arXiv preprint arXiv:2304.07193, 2023.
  29. A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, J. Clark et al., “Learning transferable visual models from natural language supervision,” in International conference on machine learning.   PMLR, 2021, pp. 8748–8763.
  30. J. H. Cho, U. Mall, K. Bala, and B. Hariharan, “Picie: Unsupervised semantic segmentation using invariance and equivariance in clustering,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021, pp. 16 794–16 804.
  31. M. Hamilton, Z. Zhang, B. Hariharan, N. Snavely, and W. T. Freeman, “Unsupervised semantic segmentation by distilling feature correspondences,” arXiv preprint arXiv:2203.08414, 2022.
  32. R. Harb and P. Knöbelreiter, “Infoseg: Unsupervised semantic image segmentation with mutual information maximization,” in Pattern Recognition: 43rd DAGM German Conference, DAGM GCPR 2021, Bonn, Germany, September 28–October 1, 2021, Proceedings.   Springer, 2022, pp. 18–32.
  33. J.-J. Hwang, S. X. Yu, J. Shi, M. D. Collins, T.-J. Yang, X. Zhang, and L.-C. Chen, “Segsort: Segmentation by discriminative sorting of segments,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2019, pp. 7334–7344.
  34. S. E. Mirsadeghi, A. Royat, and H. Rezatofighi, “Unsupervised image segmentation by mutual information maximization and adversarial regularization,” IEEE Robotics and Automation Letters, vol. 6, no. 4, pp. 6931–6938, 2021.
  35. W. Van Gansbeke, S. Vandenhende, S. Georgoulis, and L. Van Gool, “Unsupervised semantic segmentation by contrasting object mask proposals,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2021, pp. 10 052–10 062.
  36. T. Nguyen, M. Dax, C. K. Mummadi, N. Ngo, T. H. P. Nguyen, Z. Lou, and T. Brox, “Deepusps: Deep robust unsupervised saliency prediction via self-supervision,” Advances in Neural Information simeoni2021localizinglouvProcessing Systems, vol. 32, 2019.
  37. K. He, G. Gkioxari, P. Dollár, and R. Girshick, “Mask r-cnn,” in Proceedings of the IEEE international conference on computer vision, 2017, pp. 2961–2969.
  38. V. D. Blondel, J.-L. Guillaume, R. Lambiotte, and E. Lefebvre, “Fast unfolding of communities in large networks,” Journal of statistical mechanics: theory and experiment, vol. 2008, no. 10, p. P10008, 2008.
  39. L.-C. Chen, G. Papandreou, F. Schroff, and H. Adam, “Rethinking atrous convolution for semantic image segmentation,” arXiv preprint arXiv:1706.05587, 2017.
  40. M. J. Islam, C. Edge, Y. Xiao, P. Luo, M. Mehtaz, C. Morse, S. S. Enan, and J. Sattar, “Semantic segmentation of underwater imagery: Dataset and benchmark,” in 2020 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS).   IEEE, 2020, pp. 1769–1776.
  41. H. Caesar, J. Uijlings, and V. Ferrari, “Coco-stuff: Thing and stuff classes in context,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2018, pp. 1209–1218.
  42. K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 770–778.
  43. H. W. Kuhn, “The hungarian method for the assignment problem,” Naval Research Logistics (NRL), vol. 52, no. 1, pp. 7–21, 2005.
  44. Y. Ouali, C. Hudelot, and M. Tami, “Autoregressive unsupervised image segmentation,” in Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part VII 16.   Springer, 2020, pp. 142–158.
  45. S. Sarfraz, V. Sharma, and R. Stiefelhagen, “Efficient parameter-free clustering using first neighbor relations,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2019, pp. 8934–8943.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (3)
  1. Sonal Kumar (30 papers)
  2. Arijit Sur (10 papers)
  3. Rashmi Dutta Baruah (2 papers)
Citations (1)

Summary

We haven't generated a summary for this paper yet.

X Twitter Logo Streamline Icon: https://streamlinehq.com