Papers

Topics

Authors

Recent

View all

Gemini 2.5 Flash

194 tokens/sec

GPT-4o

7 tokens/sec

Gemini 2.5 Pro Pro

45 tokens/sec

o3 Pro

4 tokens/sec

GPT-4.1 Pro

38 tokens/sec

DeepSeek R1 via Azure Pro

28 tokens/sec

2000 character limit reached

866 3

Automatic Data Curation for Self-Supervised Learning: A Clustering-Based Approach (2405.15613v2)

Published 24 May 2024 in cs.LG, cs.AI, and cs.CV

Abstract: Self-supervised features are the cornerstone of modern machine learning systems. They are typically pre-trained on data collections whose construction and curation typically require extensive human effort. This manual process has some limitations similar to those encountered in supervised learning, e.g., the crowd-sourced selection of data is costly and time-consuming, preventing scaling the dataset size. In this work, we consider the problem of automatic curation of high-quality datasets for self-supervised pre-training. We posit that such datasets should be large, diverse and balanced, and propose a clustering-based approach for building ones satisfying all these criteria. Our method involves successive and hierarchical applications of $k$-means on a large and diverse data repository to obtain clusters that distribute uniformly among data concepts, followed by a hierarchical, balanced sampling step from these clusters. Extensive experiments on three different data domains including web-based images, satellite images and text show that features trained on our automatically curated datasets outperform those trained on uncurated data while being on par or better than ones trained on manually curated data. Code is available at https://github.com/facebookresearch/ssl-data-curation.

References (137)

Citations (11)

View on Semantic Scholar

Summary

The paper presents a clustering-based method that utilizes hierarchical k-means with resampling to create balanced datasets for self-supervised learning.
It demonstrates improved performance with a top-1 ImageNet accuracy of 84.7% and enhanced robustness metrics compared to uncurated data.
The approach provides a scalable, automated curation solution applicable to web images, text corpora, and satellite imagery in various domains.

A Clustering-Based Approach for Automatic Data Curation in Self-Supervised Learning

The presented paper addresses the critical issue of automatic data curation for self-supervised learning (SSL) by leveraging a clustering-based methodology. Traditional curation practices, both crowd-sourced and manual, are seen as expensive and time-consuming, thereby hindering scalability. Contrary to existing methods, the proposed approach advocates for an automatic and principled curation technique, aimed at constructing extensive and balanced datasets to improve SSL performance.

Core Propositions

The paper begins by establishing that effective SSL datasets need to be large, diverse, and balanced. Subpar performance of SSL models due to imbalanced data distributions is linked to the long-tail nature of concepts within uncurated datasets. Methods like hierarchical applications of k-means clustering are thus proposed to address this imbalance by ensuring uniform distribution among data concepts.

Clustering Approach and Experimental Validations

The authors introduce a hierarchical k-means clustering method combined with resampling-clustering steps. This approach aims to more uniformly distribute data points across various clusters, dealing with the issue of dominant concepts creating numerous small clusters.

Key performance metrics point to the superiority of their approach:

Web-based Images: SSL features trained on their automatically curated datasets exhibit improved performance compared to those trained on uncurated data. The gains are especially pronounced in robustness, out-of-distribution generalization, and long-tailed cases.
Numerical Improvements: For instance, their curated datasets achieve a top-1 accuracy of 84.7% on ImageNet validation compared to 82.8% from uncurated data. Robustness metrics improve significantly, e.g., from 14.3% mAP on Oxford Hard retrieval (uncurated) to 32.1% (curated).

Methodological Insights

Hierarchical k-means and Resampling

The hierarchical k-means technique mitigates the skewness effect by:

Initializing multiple clustering levels.
Applying k-means clustering successively to existing centroids.
Performing resampling to refine the clustering centroid distribution towards uniformity.

Their experimental results empirically verify the effectiveness of hierarchical k-means over baseline k-means. Multi-level hierarchical approaches demonstrate better balanced data distribution, as illustrated by experiments on clustered ImageNet classes.

Sampling Techniques

Flat sampling versus hierarchical sampling strategies within clusters are compared. Hierarchical sampling, especially with “random” methodology (denoted as 4r), shows the best outcomes. This strategy ensures balance not only at the top hierarchical level but across all levels.

Implications and Applications

Practical Applications

The paper’s implications extend across various data types:

Web-Based Images: Enhancing model robustness and generalization using balanced datasets.
Text Corpora: Significant performance improvements for LLMs trained on curated text data.
Satellite Imagery: Better canopy height estimations from SSL models trained on curated satellite images.

Future Directions

The authors acknowledge the potential for broad adaptation of hierarchical k-means clustering beyond SSL. Applications in active learning and data pruning are possible next steps. Yet, the reliance on features pre-trained with manual dataset involvement, like ImageNet, poses limitations. Future explorations should ideally eliminate such dependencies.

Conclusion

Overall, this paper presents a well-founded methodology to tackle the challenges of automatic data curation in SSL. The rigorous experimental validations and significant performance improvements underscore the method's efficacy. This innovative approach paves the way for more scalable and automated curation techniques, potentially reshaping SSL dataset construction, and offers pathways to more balanced, diverse, and extensive data repositories.

PDF Markdown

Tweets

https://twitter.com/arankomatsuzaki/status/1794911168244646397

https://twitter.com/_akhaliq/status/1794922938824634649

https://twitter.com/fly51fly/status/1795203553444888742

https://twitter.com/hillbig/status/1795571696373821832

https://twitter.com/cocteau/status/1795088337641877670

https://twitter.com/burny_tech/status/1796206288478540156

YouTube

Show All Videos