Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
194 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
45 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Automatic Data Curation for Self-Supervised Learning: A Clustering-Based Approach (2405.15613v2)

Published 24 May 2024 in cs.LG, cs.AI, and cs.CV

Abstract: Self-supervised features are the cornerstone of modern machine learning systems. They are typically pre-trained on data collections whose construction and curation typically require extensive human effort. This manual process has some limitations similar to those encountered in supervised learning, e.g., the crowd-sourced selection of data is costly and time-consuming, preventing scaling the dataset size. In this work, we consider the problem of automatic curation of high-quality datasets for self-supervised pre-training. We posit that such datasets should be large, diverse and balanced, and propose a clustering-based approach for building ones satisfying all these criteria. Our method involves successive and hierarchical applications of $k$-means on a large and diverse data repository to obtain clusters that distribute uniformly among data concepts, followed by a hierarchical, balanced sampling step from these clusters. Extensive experiments on three different data domains including web-based images, satellite images and text show that features trained on our automatically curated datasets outperform those trained on uncurated data while being on par or better than ones trained on manually curated data. Code is available at https://github.com/facebookresearch/ssl-data-curation.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (137)
  1. Slic superpixels compared to state-of-the-art superpixel methods. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2012.
  2. Gpt-4 technical report. arXiv preprint arXiv:2303.08774, 2023.
  3. K-means++ the advantages of careful seeding. In Proceedings of the eighteenth annual ACM-SIAM symposium on Discrete algorithms, pp.  1027–1035, 2007.
  4. Pass: An imagenet replacement for self-supervised pretraining without humans. NeurIPS Track on Datasets and Benchmarks, 2021.
  5. Self-labelling via simultaneous clustering and representation learning. In ICLR, 2020.
  6. Deep batch active learning by diverse, uncertain gradient lower bounds. In Proceedings of the International Conference on Learning Representations (ICLR), 2020.
  7. Masked siamese networks for label-efficient learning. In ECCV, 2022.
  8. Big self-supervised models advance medical image classification, 2021.
  9. Objectnet: A large-scale bias-controlled dataset for pushing the limits of object recognition models. In Advances in Neural Information Processing Systems (NeurIPS), 2019.
  10. Vicreg: Variance-invariance-covariance regularization for self-supervised learning. In ICLR, 2022.
  11. A neural probabilistic language model. Advances in neural information processing systems, 13, 2000.
  12. Birdsnap: Large-scale fine-grained visual categorization of birds. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp.  2011–2018, 2014.
  13. Are we done with imagenet? CoRR, abs/2006.07159, 2020. URL https://arxiv.org/abs/2006.07159.
  14. Piqa: Reasoning about physical commonsense in natural language. In Proceedings of the AAAI conference on artificial intelligence, pp.  7432–7439, 2020.
  15. Unsupervised learning by predicting noise. In ICML, 2017.
  16. Food-101 – mining discriminative components with random forests. In European Conference on Computer Vision, 2014.
  17. K. Brande. 3d fuel structure in relation to prescribed fire, ca 2020. national center for airborne laser mapping (ncalm). distributed by opentopography., 2021. URL https://doi.org/10.5069/G9C53J18. Accessed: 2023-02-15.
  18. Active learning for deep object detection. In Proceedings of the International Joint Conference on Computer Vision, Imaging and Computer Graphics Theory and Applications (VISIGRAPP), 2019.
  19. Deep clustering for unsupervised learning of visual features. In ECCV, 2018.
  20. Unsupervised pre-training of image features on non-curated data. In ICCV, 2019.
  21. Unsupervised learning of visual features by contrasting cluster assignments. In NeurIPS, 2020.
  22. Emerging properties in self-supervised vision transformers. In ICCV, 2021.
  23. Towards a general-purpose foundation model for computational pathology. Nature Medicine, 30(3):850–862, 2024.
  24. A simple framework for contrastive learning of visual representations. In ICML, 2020.
  25. Yizong Cheng. Mean shift, mode seeking, and clustering. IEEE Transactions on Pattern Analysis and Machine Intelligence, 17(8):790–799, 1995.
  26. Active learning for deep object detection via probabilistic modeling. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 2021.
  27. Palm: Scaling language modeling with pathways. arXiv preprint arXiv:2204.02311, 2022.
  28. Describing textures in the wild. In Proceedings of the IEEE Conf. on Computer Vision and Pattern Recognition (CVPR), 2014.
  29. Think you have solved question answering? try arc, the ai2 reasoning challenge. arXiv preprint arXiv:1803.05457, 2018.
  30. The cityscapes dataset for semantic urban scene understanding. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp.  3213–3223, 2016.
  31. Vision transformers need registers. In The Twelfth International Conference on Learning Representations, 2024. URL https://openreview.net/forum?id=2dnO3LLiJ1.
  32. Does object recognition work for everyone? In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition workshops, pp.  52–59, 2019.
  33. D. Defays. An efficient algorithm for a complete link method. The Computer Journal, 20(4):364–366, 01 1977.
  34. Scaling vision transformers to 22 billion parameters. In International Conference on Machine Learning, pp. 7480–7512. PMLR, 2023.
  35. ImageNet: A Large-Scale Hierarchical Image Database. In CVPR09, 2009.
  36. Bert: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pp.  4171–4186, 2019.
  37. Unsupervised visual representation learning by context prediction. In ICCV, 2015.
  38. Lidar surveys over selected forest research sites, brazilian amazon, 2008-2018. ornl daac, oak ridge, tennessee, usa., 2019. URL https://daac.ornl.gov/CMS/guides/LiDAR_Forest_Inventory_Brazil.html.
  39. Discriminative unsupervised feature learning with convolutional neural networks. CoRR, abs/1406.6909, 2014. URL http://arxiv.org/abs/1406.6909.
  40. With a little help from my friends: Nearest-neighbor contrastive learning of visual representations. arXiv preprint arXiv:2104.14548, 2021.
  41. Jeffrey L Elman. Finding structure in time. Cognitive science, 14(2):179–211, 1990.
  42. A density-based algorithm for discovering clusters in large spatial databases with noise. In Knowledge Discovery and Data Mining, pp.  226–231. AAAI Press, 1996.
  43. The pascal visual object classes challenge: A retrospective. International Journal of Computer Vision, 111(1):98–136, January 2015.
  44. Learning generative visual models from few training examples: An incremental bayesian approach tested on 101 object categories. In 2004 conference on computer vision and pattern recognition workshop, pp.  178–178. IEEE, 2004.
  45. What neural networks memorize and why: discovering the long tail via influence estimation. In Proceedings of the 34th International Conference on Neural Information Processing Systems (NeurIPS), 2020.
  46. Deep active learning over the long tail, 2017.
  47. Vision meets robotics: The kitti dataset. The International Journal of Robotics Research, 32(11):1231–1237, 2013.
  48. Scaling and benchmarking self-supervised visual representation learning. In ICCV, 2019.
  49. Self-supervised pretraining of visual features in the wild. preprint arXiv:2103.01988, 2021.
  50. Vision models are more robust and fair when pretrained on uncurated images without supervision. arXiv preprint arXiv:2202.08360, 2022a.
  51. Fairness indicators for systematic assessments of visual feature extractors. In 2022 ACM Conference on Fairness, Accountability, and Transparency, pp.  70–88, 2022b.
  52. Learning word vectors for 157 languages. In Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018), 2018.
  53. Quantization. IEEE Transactions on Information Theory, 44(6):2325–2383, 1998. doi: 10.1109/18.720541.
  54. Bootstrap your own latent: A new approach to self-supervised learning. In NeurIPS, 2020.
  55. Deep residual learning for image recognition. In CVPR, 2016.
  56. Momentum contrast for unsupervised visual representation learning. In CVPR, 2020.
  57. Data-efficient image recognition with contrastive predictive coding. In Proceedings of the 37th International Conference on Machine Learning, pp.  4182–4192, 2020.
  58. Benchmarking neural network robustness to common corruptions and perturbations. In International Conference on Learning Representations, 2019.
  59. The many faces of robustness: A critical analysis of out-of-distribution generalization. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp.  8340–8349, 2021a.
  60. Natural adversarial examples. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.  15262–15271, 2021b.
  61. Long short-term memory. Neural computation, 9(8):1735–1780, 1997.
  62. Training compute-optimal large language models. arXiv preprint arXiv:2203.15556, 2022.
  63. Improving bag-of-features for large scale image search. International Journal of Computer Vision, 87(3):316–336, 2010.
  64. Triviaqa: A large scale distantly supervised challenge dataset for reading comprehension. arXiv preprint arXiv:1705.03551, 2017.
  65. Product quantization for nearest neighbor search. IEEE Transactions on Pattern Analysis and Machine Intelligence, 33(1):117–128, 2011.
  66. Scaling laws for neural language models. arXiv preprint arXiv:2001.08361, 2020.
  67. 3d object representations for fine-grained categorization. In 4th International IEEE Workshop on 3D Representation and Recognition (3dRR-13), Sydney, Australia, 2013.
  68. Learning multiple layers of features from tiny images. Technical Report 0, University of Toronto, Toronto, Ontario, 2009.
  69. Natural questions: a benchmark for question answering research. Transactions of the Association for Computational Linguistics, 7:453–466, 2019.
  70. Beyond bags of features: Spatial pyramid matching for recognizing natural scene categories. In 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR’06), volume 2, pp.  2169–2178, 2006.
  71. Bloom: A 176b-parameter open-access multilingual language model. arXiv preprint arXiv:2211.05100, 2023.
  72. Yann LeCun. A path towards autonomous machine intelligence. OpenReview, 2022.
  73. Large-scale long-tailed recognition in an open world. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2019.
  74. Open long-tailed recognition in a dynamic world. TPAMI, 2022.
  75. Stuart Lloyd. Least squares quantization in pcm. IEEE transactions on information theory, 28(2):129–137, 1982.
  76. Fine-grained visual classification of aircraft. Technical report, Oxford University, 2013.
  77. Recurrent neural network based language model. Interspeech 2010, 2010.
  78. Distributed representations of words and phrases and their compositionality. Advances in neural information processing systems, 26, 2013.
  79. National Ecological Observatory Network (NEON). Ecosystem structure (dp3.30015.001), 2022. URL https://data.neonscience.org/data-products/DP3.30015.001.
  80. Automated flower classification over a large number of classes. In Indian Conference on Computer Vision, Graphics and Image Processing, Dec 2008.
  81. D. Nister and H. Stewenius. Scalable recognition with a vocabulary tree. In IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR), volume 2, pp.  2161–2168, 2006. doi: 10.1109/CVPR.2006.264.
  82. Representation learning with contrastive predictive coding. preprint arXiv:1807.03748, 2018.
  83. Dinov2: Learning robust visual features without supervision. arXiv preprint arXiv:2304.07193, 2023.
  84. Training language models to follow instructions with human feedback. Advances in neural information processing systems, 35:27730–27744, 2022.
  85. PF Panter and Wu Dite. Quantization distortion in pulse-count modulation with nonuniform spacing of levels. Proceedings of the IRE, 39(1):44–48, 1951.
  86. Cats and dogs. In IEEE Conference on Computer Vision and Pattern Recognition, 2012.
  87. Pytorch: An imperative style, high-performance deep learning library. In Proceedings of the 33rd International Conference on Neural Information Processing Systems. Curran Associates Inc., 2019.
  88. Deep learning on a data diet: Finding important examples early in training. In Advances in Neural Information Processing Systems (NeurIPS), volume 34, 2021.
  89. Object retrieval with large vocabularies and fast spatial matching. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2007.
  90. Lost in quantization: Improving particular object retrieval in large scale image databases. In CVPR, 2008.
  91. A self-supervised descriptor for image copy detection. arXiv preprint arXiv:2202.10261, 2022.
  92. Revisiting oxford and paris: Large-scale image retrieval benchmarking. In CVPR, 2018.
  93. Learning to generate reviews and discovering sentiment. arXiv preprint arXiv:1704.01444, 2017.
  94. Improving language understanding by generative pre-training. Technical report, OpenAI, 2018.
  95. Language models are unsupervised multitask learners. OpenAI blog, 1(8):9, 2019.
  96. Learning transferable visual models from natural language supervision. In International Conference on Machine Learning, pp. 8748–8763. PMLR, 2021.
  97. Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of machine learning research, 21(140):1–67, 2020.
  98. Zero-shot text-to-image generation. In Marina Meila and Tong Zhang (eds.), Proceedings of the 38th International Conference on Machine Learning, volume 139 of Proceedings of Machine Learning Research, pp.  8821–8831. PMLR, 18–24 Jul 2021. URL http://proceedings.mlr.press/v139/ramesh21a.html.
  99. Vision transformers for dense prediction. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp.  12179–12188, 2021.
  100. Do imagenet classifiers generalize to imagenet? In International Conference on Machine Learning, pp. 5389–5400. PMLR, 2019.
  101. William J Reed. The pareto, zipf and other power laws. Economics Letters, 2001.
  102. Sentence-bert: Sentence embeddings using siamese bert-networks. In EMNLP, 2019.
  103. Byol works even without batch statistics. preprint arXiv:2010.10241, 2020.
  104. Imagenet large scale visual recognition challenge. IJCV, 2015.
  105. Learning to share visual appearance for multiclass object detection. In CVPR 2011, pp.  1481–1488, 2011.
  106. Socialiqa: Commonsense reasoning about social interactions. arXiv preprint arXiv:1904.09728, 2019.
  107. Dbscan revisited, revisited. ACM Transactions on Database Systems (TODS), 42:1 – 21, 2017.
  108. Active learning for convolutional neural networks: A core-set approach. In International Conference on Learning Representations, 2018. URL https://openreview.net/forum?id=H1aIuk-RW.
  109. Burr Settles. Active learning literature survey. Technical Report Computer Sciences Technical Report 1648, University of Wisconsin-Madison, 2009a.
  110. Burr Settles. Active learning literature survey. Technical report, University of Wisconsin-Madison, 2009b.
  111. R. Sibson. SLINK: An optimally efficient algorithm for the single-link cluster method. The Computer Journal, 16(1):30–34, 01 1973.
  112. Indoor segmentation and support inference from rgbd images. In European conference on computer vision, pp.  746–760. Springer, 2012.
  113. Sun rgb-d: A rgb-d scene understanding benchmark suite. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp.  567–576, 2015.
  114. Beyond neural scaling laws: beating power law scaling via data pruning. In Advances in Neural Information Processing Systems, 2022.
  115. Divide and contrast: Self-supervised learning from uncurated data. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp.  10063–10074, 2021.
  116. Sub-meter resolution canopy height maps using self-supervised learning and a vision transformer trained on aerial and gedi lidar. arXiv preprint arXiv:2304.07213, 2023.
  117. An empirical study of example forgetting during deep neural network learning. In ICLR, 2019.
  118. Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971, 2023.
  119. Subtab: Subsetting features of tabular data for self-supervised representation learning. Advances in Neural Information Processing Systems, 34, 2021.
  120. The inaturalist species classification and detection dataset. In CVPR, 2018.
  121. Benchmarking representation learning for natural world image collections. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp.  12884–12893, 2021.
  122. Active learning strategies for weakly-supervised object detection. In ECCV, October 2022.
  123. Learning robust global representations by penalizing local predictive power. In Advances in Neural Information Processing Systems, pp. 10506–10518, 2019.
  124. Ccnet: Extracting high quality monolingual datasets from web crawl data. arXiv preprint arXiv:1911.00359, 2019.
  125. Unsupervised feature learning via non-parametric instance discrimination. In CVPR, 2018.
  126. Sun database: Large-scale scene recognition from abbey to zoo. In 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, pp.  3485–3492, June 2010. doi: 10.1109/CVPR.2010.5539970.
  127. Demystifying clip data. In ICLR, 2024.
  128. Billion-scale semi-supervised learning for image classification. arXiv preprint arXiv:1905.00546, 2019.
  129. A robust em clustering algorithm for gaussian mixture models. Pattern Recognition, 45(11):3950–3961, 2012. ISSN 0031-3203.
  130. Paul Zador. Asymptotic quantization error of continuous signals and the quantization dimension. IEEE Transactions on Information Theory, 28(2):139–149, 1982.
  131. Paul Laszlo Zador. Development and evaluation of procedures for quantizing multivariate distributions. Stanford University, 1964.
  132. Barlow twins: Self-supervised learning via redundancy reduction. In International Conference on Machine Learning, 2021.
  133. Hellaswag: Can a machine really finish your sentence? arXiv preprint arXiv:1905.07830, 2019.
  134. Opt: Open pre-trained transformer language models. arXiv preprint arXiv:2205.01068, 2022.
  135. Fedor Zhdanov. Diverse mini-batch active learning, 2019.
  136. Scene parsing through ade20k dataset. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp.  633–641, 2017.
  137. Capturing long-tail distributions of object subcategories. In 2014 IEEE Conference on Computer Vision and Pattern Recognition, pp.  915–922, 2014.
Citations (11)

Summary

  • The paper presents a clustering-based method that utilizes hierarchical k-means with resampling to create balanced datasets for self-supervised learning.
  • It demonstrates improved performance with a top-1 ImageNet accuracy of 84.7% and enhanced robustness metrics compared to uncurated data.
  • The approach provides a scalable, automated curation solution applicable to web images, text corpora, and satellite imagery in various domains.

A Clustering-Based Approach for Automatic Data Curation in Self-Supervised Learning

The presented paper addresses the critical issue of automatic data curation for self-supervised learning (SSL) by leveraging a clustering-based methodology. Traditional curation practices, both crowd-sourced and manual, are seen as expensive and time-consuming, thereby hindering scalability. Contrary to existing methods, the proposed approach advocates for an automatic and principled curation technique, aimed at constructing extensive and balanced datasets to improve SSL performance.

Core Propositions

The paper begins by establishing that effective SSL datasets need to be large, diverse, and balanced. Subpar performance of SSL models due to imbalanced data distributions is linked to the long-tail nature of concepts within uncurated datasets. Methods like hierarchical applications of k-means clustering are thus proposed to address this imbalance by ensuring uniform distribution among data concepts.

Clustering Approach and Experimental Validations

The authors introduce a hierarchical k-means clustering method combined with resampling-clustering steps. This approach aims to more uniformly distribute data points across various clusters, dealing with the issue of dominant concepts creating numerous small clusters.

Key performance metrics point to the superiority of their approach:

  • Web-based Images: SSL features trained on their automatically curated datasets exhibit improved performance compared to those trained on uncurated data. The gains are especially pronounced in robustness, out-of-distribution generalization, and long-tailed cases.
  • Numerical Improvements: For instance, their curated datasets achieve a top-1 accuracy of 84.7% on ImageNet validation compared to 82.8% from uncurated data. Robustness metrics improve significantly, e.g., from 14.3% mAP on Oxford Hard retrieval (uncurated) to 32.1% (curated).

Methodological Insights

Hierarchical k-means and Resampling

The hierarchical k-means technique mitigates the skewness effect by:

  1. Initializing multiple clustering levels.
  2. Applying k-means clustering successively to existing centroids.
  3. Performing resampling to refine the clustering centroid distribution towards uniformity.

Their experimental results empirically verify the effectiveness of hierarchical k-means over baseline k-means. Multi-level hierarchical approaches demonstrate better balanced data distribution, as illustrated by experiments on clustered ImageNet classes.

Sampling Techniques

Flat sampling versus hierarchical sampling strategies within clusters are compared. Hierarchical sampling, especially with “random” methodology (denoted as 4r), shows the best outcomes. This strategy ensures balance not only at the top hierarchical level but across all levels.

Implications and Applications

Practical Applications

The paper’s implications extend across various data types:

  • Web-Based Images: Enhancing model robustness and generalization using balanced datasets.
  • Text Corpora: Significant performance improvements for LLMs trained on curated text data.
  • Satellite Imagery: Better canopy height estimations from SSL models trained on curated satellite images.

Future Directions

The authors acknowledge the potential for broad adaptation of hierarchical k-means clustering beyond SSL. Applications in active learning and data pruning are possible next steps. Yet, the reliance on features pre-trained with manual dataset involvement, like ImageNet, poses limitations. Future explorations should ideally eliminate such dependencies.

Conclusion

Overall, this paper presents a well-founded methodology to tackle the challenges of automatic data curation in SSL. The rigorous experimental validations and significant performance improvements underscore the method's efficacy. This innovative approach paves the way for more scalable and automated curation techniques, potentially reshaping SSL dataset construction, and offers pathways to more balanced, diverse, and extensive data repositories.

Youtube Logo Streamline Icon: https://streamlinehq.com