Dataset Condensation with Distribution Matching (2110.04181v3)

Published 8 Oct 2021 in cs.LG and cs.CV

Abstract: Computational cost of training state-of-the-art deep models in many learning problems is rapidly increasing due to more sophisticated models and larger datasets. A recent promising direction for reducing training cost is dataset condensation that aims to replace the original large training set with a significantly smaller learned synthetic set while preserving the original information. While training deep models on the small set of condensed images can be extremely fast, their synthesis remains computationally expensive due to the complex bi-level optimization and second-order derivative computation. In this work, we propose a simple yet effective method that synthesizes condensed images by matching feature distributions of the synthetic and original training images in many sampled embedding spaces. Our method significantly reduces the synthesis cost while achieving comparable or better performance. Thanks to its efficiency, we apply our method to more realistic and larger datasets with sophisticated neural architectures and obtain a significant performance boost. We also show promising practical benefits of our method in continual learning and neural architecture search.

Authors (2)

Bo Zhao (242 papers)
Hakan Bilen (62 papers)

Citations (233)

View on Semantic Scholar

Summary

The paper introduces a method that generates synthetic datasets by matching feature distributions across embedding spaces, eliminating costly bi-level optimization.
It achieves a 45x reduction in synthesis time on CIFAR10 and maintains 98.6% baseline accuracy on MNIST with only a fraction of the original data.
The approach scales to large datasets and diverse neural architectures, enabling efficient continual learning and neural architecture search.

Dataset Condensation with Distribution Matching

The paper "Dataset Condensation with Distribution Matching" presents a method to address the computational inefficiencies associated with training state-of-the-art deep learning models on large datasets. The proposed solution focuses on dataset condensation, aiming to create a smaller synthetic dataset that encapsulates the essential information of the original, much larger dataset. Such condensed datasets allow for faster model training while preserving the performance levels achieved with the full data.

Key Contributions

Optimization Approach: The proposed method introduces a novel way to generate synthetic datasets by matching the feature distributions between synthetic and real datasets across various embedding spaces. This technique significantly reduces the necessity for computationally expensive bi-level optimization and second-order derivative calculations typically associated with previous methods.
Efficiency Gains: The presented approach improves the efficiency of synthetic data generation over prior methods. For example, the paper highlights achieving a 45x reduction in synthesis time on the CIFAR10 dataset when generating a condensed dataset of 500 images as compared to state-of-the-art benchmarks.
Practical Applications: The method's efficiency and small computational burden allow it to scale to larger and more realistic datasets, such as ImageNet-1K and TinyImageNet. Furthermore, this makes it suitable for use cases like continual learning and neural architecture search where computational resources are a critical constraint.
Model Robustness: The technique provides consistent performance across different neural network architectures and sizes of synthetic datasets, indicating its robustness and adaptability to varying computational models and requirements.

Strong Numerical Results

The method achieves competitive or superior performance compared to state-of-the-art dataset condensation techniques while maintaining computational efficiency.
Specifically, the synthesized datasets achieve 98.6% of baseline performance on MNIST with only 0.83% of the training data used, and recover 60% of baseline performance on TinyImageNet using only 10% of the original dataset size.

Implications and Future Directions

The practical implications of this research are significant, marking an opportunity to democratize deep learning by reducing dependency on massive computational resources. The approach opens pathways for further research into efficient training regimes without compromising model accuracy. This could lead to more sustainable AI practices, where energy and financial costs of training large models are minimized.

Potential future research directions could explore:

Extending the technique to other domains such as natural language processing, where dataset sizes are also increasing rapidly.
Refining the approach for specific applications like reinforcement learning, wherein real-time decision-making can benefit from rapid training cycles.
Investigating further into the generative aspects of synthetic dataset creation to enhance data diversity and further optimize training efficiency.

In conclusion, this paper contributes to the field of dataset condensation by presenting a method that effectively balances training efficiency and performance preservation, thus playing a pivotal role in scaling AI models sustainably.

PDF Markdown

Related Papers

Tweets

https://twitter.com/epsilonreticuli/status/1897065728521724114