Dataset Distillation via Curriculum Data Synthesis in Large Data Era (2311.18838v2)

Published 30 Nov 2023 in cs.CV, cs.AI, and cs.LG

Abstract: Dataset distillation or condensation aims to generate a smaller but representative subset from a large dataset, which allows a model to be trained more efficiently, meanwhile evaluating on the original testing data distribution to achieve decent performance. Previous decoupled methods like SRe$^2$L simply use a unified gradient update scheme for synthesizing data from Gaussian noise, while, we notice that the initial several update iterations will determine the final outline of synthesis, thus an improper gradient update strategy may dramatically affect the final generation quality. To address this, we introduce a simple yet effective global-to-local gradient refinement approach enabled by curriculum data augmentation ($\texttt{CDA}$) during data synthesis. The proposed framework achieves the current published highest accuracy on both large-scale ImageNet-1K and 21K with 63.2% under IPC (Images Per Class) 50 and 36.1% under IPC 20, using a regular input resolution of 224$\times$224 with faster convergence speed and less synthetic time. The proposed model outperforms the current state-of-the-art methods like SRe$^2$L, TESLA, and MTT by more than 4% Top-1 accuracy on ImageNet-1K/21K and for the first time, reduces the gap to its full-data training counterparts to less than absolute 15%. Moreover, this work represents the inaugural success in dataset distillation on the larger-scale ImageNet-21K dataset under the standard 224$\times$224 resolution. Our code and distilled ImageNet-21K dataset of 20 IPC, 2K recovery budget are available at https://github.com/VILA-Lab/SRe2L/tree/main/CDA.

Authors (2)

Zeyuan Yin (7 papers)
Zhiqiang Shen (172 papers)

Citations (7)

View on Semantic Scholar

Summary

Dataset Distillation in the Era of Large-Scale Datasets

The presented paper addresses the emergent challenge of dataset distillation in the context of expanding data scales, specifically targeting datasets like ImageNet-1K and ImageNet-21K. Dataset distillation, which involves generating compact, representative subsets of large datasets, is particularly relevant in the current era where extensive datasets pose significant computational and storage demands. This work introduces a novel approach to dataset distillation by leveraging a strategy termed Curriculum Data Augmentation (CDA) which advances prior methodologies both conceptually and in terms of performance metrics.

Key Contributions and Methodology

The paper makes several contributions, primarily in the sphere of enhancing dataset distillation methods to handle large-scale datasets efficiently. This is achieved through the following innovations:

Curriculum Data Augmentation (CDA): A core contribution is the introduction of CDA. The premise behind CDA builds upon philosophy from curriculum learning, where the learning process benefits from gradually introducing complexity. In this context, CDA manages the data synthesis difficulty by adjusting the cropping of training samples, incrementally exposing the model to more complex portions of the data.
Integration of Curriculum and Reverse Curriculum Learning: The paper compares different paradigms of data synthesis—standard curriculum learning, reverse curriculum learning, and a constant learning baseline. These paradigms are operationalized through the strategic application of data augmentation techniques, particularly the parameterization of image crops using RandomResizedCrop.
Empirical Evaluation: The authors empirically validate their approach on CIFAR-100, Tiny-ImageNet, ImageNet-1K, and, notably, ImageNet-21K. CDA's application on large-scale datasets like ImageNet-21K marks a pioneering effort in this domain. In these evaluations, CDA consistently outperforms existing state-of-the-art methods, demonstrating improvements of more than 4% Top-1 accuracy on ImageNet-1K when compared against prominent baseline methods.
Theoretical and Practical Implications: By distilling datasets to a significant compactness while retaining robust classification accuracy, the paper highlights practical implications for regularly leveraging large-scale datasets in resource-constrained environments. Moreover, the synthesized datasets may offer fewer privacy concerns since they potentially exclude raw, personally identifiable data.

Numerical Results and Achievements

The paper reports substantial numerical results. Specifically, for ImageNet-1K under 50 IPC, the proposed method achieved an impressive accuracy of 63.2%, indicating a Top-1 accuracy improvement surpassing 4% over previous approaches. On ImageNet-21K, the method reaches a Top-1 accuracy of 36.1% with IPC 20, thus narrowing the gap to less than absolute 15% compared to the full dataset counterpart.

Practical and Theoretical Implications

Practically, this research could democratize access to powerful machine learning models by reducing the computational resources required for training, especially in contexts with limited data storage or processing capabilities. Theoretically, this investigation opens avenues for further research into synthesis strategies and curriculum-based learning schedules that might enhance generalization and reduce overfitting further in distilled datasets.

Future Outlook in AI

The promising results from CDA point towards a future where efficient dataset handling in vast datasets is possible. This can extend beyond image datasets to text, audio, and other modalities, potentially transforming data management and training paradigms across AI subfields. Future investigations might delve into more sophisticated curriculum strategies or adaptive data augmentation techniques to further optimize these results, enhancing both the applicability and efficiency of dataset distillation methodologies.

In summary, this paper leverages curriculum learning in a novel way to synthesize representative subsets of large datasets efficiently. Its findings suggest exciting possibilities for AI model training, particularly in resource-limited settings, paving the way for more inclusive and widespread AI application development.

PDF Markdown