Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
140 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
46 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Neural Clustering based Visual Representation Learning (2403.17409v1)

Published 26 Mar 2024 in cs.CV

Abstract: We investigate a fundamental aspect of machine vision: the measurement of features, by revisiting clustering, one of the most classic approaches in machine learning and data analysis. Existing visual feature extractors, including ConvNets, ViTs, and MLPs, represent an image as rectangular regions. Though prevalent, such a grid-style paradigm is built upon engineering practice and lacks explicit modeling of data distribution. In this work, we propose feature extraction with clustering (FEC), a conceptually elegant yet surprisingly ad-hoc interpretable neural clustering framework, which views feature extraction as a process of selecting representatives from data and thus automatically captures the underlying data distribution. Given an image, FEC alternates between grouping pixels into individual clusters to abstract representatives and updating the deep features of pixels with current representatives. Such an iterative working mechanism is implemented in the form of several neural layers and the final representatives can be used for downstream tasks. The cluster assignments across layers, which can be viewed and inspected by humans, make the forward process of FEC fully transparent and empower it with promising ad-hoc interpretability. Extensive experiments on various visual recognition models and tasks verify the effectiveness, generality, and interpretability of FEC. We expect this work will provoke a rethink of the current de facto grid-style paradigm.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (141)
  1. Machine vision, volume 1. Cambridge University Press, 2004.
  2. Christopher M Bishop. Pattern recognition. Machine learning, 128(9), 2006.
  3. Deep learning. nature, 521(7553):436–444, 2015.
  4. Sift: Predicting amino acid changes that affect protein function. Nucleic acids research, 31(13):3812–3814, 2003.
  5. Histograms of oriented gradients for human detection. In CVPR, pages 886–893, 2005.
  6. Surf: Speeded up robust features. In ECCV, pages 404–417, 2006.
  7. Brief: Binary robust independent elementary features. In ECCV, pages 778–792, 2010.
  8. Orb: An efficient alternative to sift or surf. In ICCV, pages 2564–2571, 2011.
  9. Brisk: Binary robust invariant scalable keypoints. In ICCV, pages 2548–2555, 2011.
  10. Very deep convolutional networks for large-scale image recognition. In ICLR, 2015.
  11. Deep residual learning for image recognition. In CVPR, 2016.
  12. Attention is all you need. In NeurIPS, 2017.
  13. An image is worth 16x16 words: Transformers for image recognition at scale. In ICLR, 2020.
  14. Mlp-mixer: An all-mlp architecture for vision. In NeurIPS, pages 24261–24272, 2021.
  15. Resmlp: Feedforward networks for image classification with data-efficient training. IEEE TPAMI, 45(4):5314–5321, 2022.
  16. Digital image processing. Addison-Wesley Longman Publishing Co., Inc., 1987.
  17. Kenneth R Castleman. Digital image processing. Prentice Hall Press, 1996.
  18. Image representations for visual learning. Science, 272(5270):1905–1909, 1996.
  19. A survey of methods for explaining black box models. ACM computing surveys (CSUR), 51(5):1–42, 2018.
  20. Probability distribution estimation for autoregressive pixel-predictive image coding. IEEE TIP, 25(3):1382–1395, 2016.
  21. Irving Biederman. Recognition-by-components: a theory of human image understanding. Psychological review, 94(2):115, 1987.
  22. Motion coherence affects human perception and pursuit similarly. Visual neuroscience, 17(1):139–153, 2000.
  23. Hierarchical structure is employed by humans during visual motion perception. Proceedings of the National Academy of Sciences, 117(39):24581–24589, 2020.
  24. Imagenet large scale visual recognition challenge. IJCV, 115(3):211–252, 2015.
  25. Context-dependent semantic processing in the human brain: Evidence from idiom comprehension. Journal of Cognitive Neuroscience, 25(5):762–776, 2013.
  26. Image as set of points. In ICLR, 2023.
  27. Clusterfomer: Clustering as a universal visual learner. In NeurIPS, 2023a.
  28. Grad-cam: Visual explanations from deep networks via gradient-based localization. In ICCV, 2017.
  29. Interpretable machine learning: Fundamental principles and 10 grand challenges. Statistics Surveys, 16:1–85, 2022.
  30. You only look once: Unified, real-time object detection. In CVPR, pages 779–788, 2016.
  31. Faster r-cnn: Towards real-time object detection with region proposal networks. In NeurIPS, 2015.
  32. End-to-end object detection with transformers. In ECCV, 2020.
  33. Nicest: Noisy label correction and training for robust scene graph generation. arXiv preprint arXiv:2207.13316, 2022a.
  34. The devil is in the labels: Noisy label correction for robust scene graph generation. In CVPR, pages 18869–18878, 2022b.
  35. Addressing predicate overlap in scene graph generation with semantic granularity controller. In ICME, pages 78–83, 2023.
  36. Compositional feature augmentation for unbiased scene graph generation. In ICCV, pages 21685–21695, 2023a.
  37. Proposalcontrast: Unsupervised pre-training for lidar-based 3d object detection. In ECCV, pages 17–33, 2022.
  38. Clustering based point cloud representation learning for 3d analysis. In ICCV, pages 8283–8294, 2023.
  39. Interpretable3d: An ad-hoc interpretable classifier for 3d point clouds. In AAAI, 2024.
  40. Semi-supervised video object segmentation with super-trajectories. IEEE TPAMI, 41(4):985–998, 2018.
  41. Gmmseg: Gaussian mixture based generative semantic segmentation models. In NeurIPS, pages 31360–31375, 2022.
  42. Rethinking semantic segmentation: A prototype view. In CVPR, 2022.
  43. Clustseg: Clustering for universal segmentation. In ICML, pages 20787–20809, 2023b.
  44. Unified mask embedding and correspondence learning for self-supervised video segmentation. In CVPR, pages 18706–18716, 2023b.
  45. Significance analysis for clustering with single-cell rna-sequencing data. Nature Methods, 20(8):1196–1202, 2023.
  46. Clustering predicted structures at the scale of the known protein universe. Nature, 622(7983):637–645, 2023.
  47. Predicting multiple conformations via sequence clustering and alphafold2. Nature, 625(7996):832–839, 2024.
  48. Clustering for protein representation learning. In CVPR, 2024.
  49. Ren and Malik. Learning a classification model for segmentation. In ICCV, pages 10–17, 2003.
  50. Perceiver: General perception with iterative attention. In ICML, pages 4651–4664, 2021.
  51. Hip: Hierarchical perceiver. arXiv preprint arXiv:2202.10890, 2022.
  52. Recognition using regions. In CVPR, pages 1030–1037, 2009.
  53. Semantic segmentation with second-order pooling. In ECCV, pages 430–443, 2012.
  54. Imagenet classification with deep convolutional neural networks. In NeurIPS, 2012.
  55. A convnet for the 2020s. In CVPR, pages 11976–11986, 2022.
  56. Going deeper with convolutions. In CVPR, pages 1–9, 2015.
  57. Densely connected convolutional networks. In CVPR, pages 4700–4708, 2017.
  58. Efficientnet: Rethinking model scaling for convolutional neural networks. In ICML, pages 6105–6114, 2019.
  59. Mobilenets: Efficient convolutional neural networks for mobile vision applications. In CVPR, 2017.
  60. Deformable convolutional networks. In CVPR, 2017.
  61. Transformers in vision: A survey. ACM computing surveys (CSUR), 54(10s):1–41, 2022.
  62. How to train vision transformer on small-scale datasets? In BMVC, 2022.
  63. Coatnet: Marrying convolution and attention for all data sizes. In NeurIPS, pages 3965–3977, 2021.
  64. Mobile-former: Bridging mobilenet and transformer. In CVPR, pages 5270–5279, 2022a.
  65. Hire-mlp: Vision mlp via hierarchical rearrangement. In CVPR, pages 826–836, 2022.
  66. Vision permutator: A permutable mlp-like architecture for visual recognition. IEEE TPAMI, 45(1):1328–1334, 2022.
  67. An image patch is a wave: Phase-aware vision mlp. In CVPR, pages 10935–10944, 2022.
  68. Cyclemlp: A mlp-like architecture for dense prediction. In ICLR, 2022b.
  69. Shuffle transformer: Rethinking spatial shuffle for vision transformer. arXiv preprint arXiv:2106.03650, 2021.
  70. As-mlp: An axial shifted mlp architecture for vision. In ICLR, 2021.
  71. Metaformer is actually what you need for vision. In CVPR, pages 10819–10829, 2022.
  72. The dangers of post-hoc interpretability: Unjustified counterfactual explanations. In IJCAI, 2019.
  73. Cynthia Rudin. Stop explaining black box machine learning models for high stakes decisions and use interpretable models instead. Nature Machine Intelligence, 1(5):206–215, 2019.
  74. Explainable artificial intelligence (xai): Concepts, taxonomies, opportunities and challenges toward responsible ai. Information Fusion, 58:82–115, 2020.
  75. This looks like that: deep learning for interpretable image recognition. In NeurIPS, 2019a.
  76. Visualizing higher-layer features of a deep network. University of Montreal, 1341(3):1, 2009.
  77. Deep inside convolutional networks: Visualising image classification models and saliency maps. arXiv preprint arXiv:1312.6034, 2013.
  78. Visualizing and understanding convolutional networks. In ECCV, 2014.
  79. Understanding deep image representations by inverting them. In CVPR, 2015.
  80. Understanding neural networks through deep visualization. arXiv preprint arXiv:1506.06579, 2015.
  81. On pixel-wise explanations for non-linear classifier decisions by layer-wise relevance propagation. PloS one, 10(7):e0130140, 2015.
  82. Learning deep features for discriminative localization. In CVPR, 2016.
  83. Learning important features through propagating activation differences. In ICML, 2017.
  84. "why should i trust you?" explaining the predictions of any classifier. In KDD, 2016.
  85. Visualizing deep neural network decisions: Prediction difference analysis. In ICLR, 2017.
  86. Understanding black-box predictions via influence functions. In ICML, 2017.
  87. A unified approach to interpreting model predictions. In NeurIPS, 2017.
  88. Emerging properties in self-supervised vision transformers. In ICCV, pages 9650–9660, 2021.
  89. Dinov2: Learning robust visual features without supervision. arXiv preprint arXiv:2304.07193, 2023.
  90. Towards robust interpretability with self-explaining neural networks. In NeurIPS, 2018.
  91. Interpretability beyond feature attribution: Quantitative testing with concept activation vectors (tcav). In ICML, 2018.
  92. Tong Wang. Gaining free or low-cost interpretability with interpretable partial substitute. In ICML, 2019.
  93. Spine: Sparse interpretable neural embeddings. In AAAI, 2018.
  94. Infogan: Interpretable representation learning by information maximizing generative adversarial nets. In NeurIPS, 2016.
  95. Deep lattice networks and partial monotonic functions. In NeurIPS, 2017.
  96. Visual recognition with deep nearest centroids. In ICLR, 2023.
  97. Emergence of segmentation with minimalistic white-box transformers. arXiv preprint arXiv:2308.16271, 2023a.
  98. White-box transformers via sparse rate reduction. arXiv preprint arXiv:2306.01129, 2023b.
  99. Microsoft coco: Common objects in context. In ECCV, 2014.
  100. Scene parsing through ade20k dataset. In CVPR, 2017.
  101. Ross Wightman. Pytorch image models. https://github.com/rwightman/pytorch-image-models, 2019.
  102. Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101, 2017.
  103. Swin transformer: Hierarchical vision transformer using shifted windows. In ICCV, 2021a.
  104. Training data-efficient image transformers & distillation through attention. In ICML, pages 10347–10357, 2021.
  105. Pay attention to mlps. In NeurIPS, pages 9204–9215, 2021b.
  106. Pyramid vision transformer: A versatile backbone for dense prediction without convolutions. In ICCV, pages 568–578, 2021a.
  107. Patches are all you need? arXiv preprint arXiv:2201.09792, 2022.
  108. Panoptic feature pyramid networks. In CVPR, pages 6399–6408, 2019.
  109. Mask r-cnn. In ICCV, pages 2961–2969, 2017.
  110. Scikit-learn: Machine learning in Python. JMLR, 12:2825–2830, 2011.
  111. MMSegmentation Contributors. MMSegmentation: Openmmlab semantic segmentation toolbox and benchmark. https://github.com/open-mmlab/mmsegmentation, 2020.
  112. Mmdetection: Open mmlab detection toolbox and benchmark. arXiv preprint arXiv:1906.07155, 2019b.
  113. Random erasing data augmentation. In AAAI, pages 13001–13008, 2020.
  114. mixup: Beyond empirical risk minimization. arXiv preprint arXiv:1710.09412, 2017.
  115. Cutmix: Regularization strategy to train strong classifiers with localizable features. In ICCV, pages 6023–6032, 2019.
  116. Rethinking the inception architecture for computer vision. In CVPR, pages 2818–2826, 2016.
  117. Acceleration of stochastic approximation by averaging. SIAM journal on control and optimization, 30(4):838–855, 1992.
  118. Tom M Mitchell. Machine learning and data mining. Communications of the ACM, 42(11):30–36, 1999.
  119. Survey of clustering algorithms ieee transactions on neural networks, vol. 16 (3), 2005.
  120. Scan: Learning to classify images without labels. In ECCV, 2020.
  121. Medical image analysis: Progress over two decades and the challenges ahead. TPAMI, 22(1):85–106, 2000.
  122. Design and interpretation of universal adversarial patches in face detection. In ECCV, pages 174–191, 2020.
  123. Efficient decision-based black-box adversarial attacks on face recognition. In CVPR, pages 7714–7722, 2019.
  124. End-to-end learning approach for autonomous driving: A convolutional neural network model. In ICAART, 2019.
  125. Visual localization for autonomous driving: Mapping the accurate location in the city maze. In ICPR, 2021c.
  126. Charles E Antoniak. Mixtures of dirichlet processes with applications to bayesian nonparametric problems. The annals of statistics, pages 1152–1174, 1974.
  127. Thomas S Ferguson. A bayesian analysis of some nonparametric problems. The annals of statistics, pages 209–230, 1973.
  128. Parallel sampling of dp mixture models using sub-cluster splits. In NeurIPS, 2013.
  129. Gang Chen. Deep learning with nonparametric clustering. arXiv preprint arXiv:1501.03084, 2015.
  130. Distributed mcmc inference in dirichlet process mixture models using julia. In 2019 19th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGRID), pages 518–525, 2019.
  131. Dnb: A joint learning framework for deep bayesian nonparametric clustering. TNNLS, 33(12):7610–7620, 2021b.
  132. Variational inference for dirichlet process mixtures. Bayesian analysis, 2006.
  133. Stochastic variational inference. JMLR, 2013.
  134. Memoized online variational inference for dirichlet process mixture models. In NeurIPS, 2013.
  135. Streaming variational inference for dirichlet process mixtures. In ACML, pages 237–252, 2016.
  136. Accelerated variational dirichlet process mixtures. In NeurIPS, 2006.
  137. A density-based algorithm for discovering clusters in large spatial databases with noise. In KDD, pages 226–231, 1996.
  138. Deep continuous clustering. arXiv preprint arXiv:1803.01449, 2018.
  139. Video face clustering with unknown number of clusters. In ICCV, pages 5027–5036, 2019.
  140. Neural clustering processes. In ICML, pages 7455–7465, 2020.
  141. Deepdpm: Deep clustering with an unknown number of clusters. In CVPR, 2022.
Citations (3)

Summary

We haven't generated a summary for this paper yet.

X Twitter Logo Streamline Icon: https://streamlinehq.com