SemDeDup: Data-efficient learning at web-scale through semantic deduplication (2303.09540v3)

Published 16 Mar 2023 in cs.LG, cs.AI, and cs.CV

Abstract: Progress in machine learning has been driven in large part by massive increases in data. However, large web-scale datasets such as LAION are largely uncurated beyond searches for exact duplicates, potentially leaving much redundancy. Here, we introduce SemDeDup, a method which leverages embeddings from pre-trained models to identify and remove semantic duplicates: data pairs which are semantically similar, but not exactly identical. Removing semantic duplicates preserves performance and speeds up learning. Analyzing a subset of LAION, we show that SemDeDup can remove 50% of the data with minimal performance loss, effectively halving training time. Moreover, performance increases out of distribution. Also, analyzing LLMs trained on C4, a partially curated dataset, we show that SemDeDup improves over prior approaches while providing efficiency gains. SemDeDup provides an example of how simple ways of leveraging quality embeddings can be used to make models learn faster with less data.

PDF Abstract

SemDeDup: Data-Efficient Learning through Semantic Deduplication

The presented paper investigates the potential of semantic deduplication to enhance data efficiency in machine learning. It introduces SemDeDup, an innovative approach to identifying and removing semantic duplicates within large-scale datasets. This method leverages embeddings from pre-trained models, enabling the efficient identification of data pairs with similar semantic content but distinct perceptual characteristics.

Conceptual Foundation

Prominent models like CLIP are traditionally trained on extensive datasets, which, while enhancing performance, lead to high computational costs due to redundancy. The paper recognizes various classes of redundant data, including perceptual duplicates, semantic duplicates, and semantically redundant data, and tackles the challenge of semantic duplication utilizing the conceptual framework of embedding spaces, leading to more data-efficient training paradigms.

Methodology and Implementation

Semantic Deduplication Framework: SemDeDup starts by embedding data points using a pre-trained model to exploit the semantically meaningful geometric properties of the embedding space.
Efficiency in Computation: The method employs k-means clustering to segment data efficiently, reducing the complexity of deduplication from $\mathcal{O}(n^2)$ to $\mathcal{O}(n^2/k)$ , where $k$ represents the number of clusters.
Empirical Results: The paper reported significant semantic redundancy in the LAION-440M dataset, demonstrating that up to 50% of the dataset can be removed with minimal impact on performance. This manifests not only in enhanced accuracy but also in improved training speeds.

Evaluation and Outcomes

The empirical evaluation across various datasets demonstrates the efficacy of SemDeDup. When applied to the LAION-440M dataset, zero-shot evaluation evidenced a robust increase in performance after semantically deduplicating the data, with 50% dataset volume reduction showing less than a 0.5% drop in top-1 accuracy. Additionally, adopting SemDeDup improved training convergence rates, dramatically optimizing the cost-to-performance ratio in deep learning contexts.

Out-of-Distribution Robustness: The model's robustness against out-of-distribution data also benefited significantly, as semantic deduplication invariably led to superior average accuracies over traditional methods.
Dataset Variability: Implementing SemDeDup on linguistic data like C4 resulted in moderate efficiency gains, reducing training computation by up to 15%.

Theoretical and Practical Implications

The implications of reducing dataset size while maintaining or improving model performance are notably practical. Beyond the immediate computational savings, these results pave the way for more democratized access to model training on vast datasets, previously restricted due to resource necessities. It suggests that while scaling laws predict performance gains from increased data quantity, introducing strategic data quality improvement measures like SemDeDup can accelerate or even supersede those gains by mitigating redundancy.

Future Directions

The results outlined pave the way for further explorations into more unified strategies for optimizing datasets across different learning paradigms. Areas for expansion include:

Combinatorial Deduplication: Exploring combinations with existing deduplication strategies.
Error Analysis of Deduplication: Understanding when semantic deduplication aligns or diverges from traditional power law gains in dataset scaling.
Optimizing Deduplication Algorithms: Further fine-tuning semantical thresholds to tailor specificity of removal on diverse dataset classes.

In conclusion, by revealing the hidden redundancies in large web-derived datasets, SemDeDup provides a straightforward yet effective algorithm to enhance the training efficiency of large models significantly. Through such strategies, it illustrates that improved dataset curation could become the cornerstone of scalable and sustainable AI development.