Dataset Distillation (1811.10959v3)

Published 27 Nov 2018 in cs.LG and stat.ML

Abstract: Model distillation aims to distill the knowledge of a complex model into a simpler one. In this paper, we consider an alternative formulation called dataset distillation: we keep the model fixed and instead attempt to distill the knowledge from a large training dataset into a small one. The idea is to synthesize a small number of data points that do not need to come from the correct data distribution, but will, when given to the learning algorithm as training data, approximate the model trained on the original data. For example, we show that it is possible to compress 60,000 MNIST training images into just 10 synthetic distilled images (one per class) and achieve close to original performance with only a few gradient descent steps, given a fixed network initialization. We evaluate our method in various initialization settings and with different learning objectives. Experiments on multiple datasets show the advantage of our approach compared to alternative methods.

PDF Abstract

An Overview of Dataset Distillation

The paper "Dataset Distillation" by Tongzhou Wang, Jun-Yan Zhu, Antonio Torralba, and Alexei A. Efros presents a novel framework for compressing large datasets into small, synthetic datasets to facilitate efficient neural network training. This approach, termed as "dataset distillation", diverges from the paradigm of model distillation and instead focuses on encoding the information from an extensive dataset into a minimal set of synthetic images.

Core Methodology

The essence of dataset distillation lies in synthesizing a compact set of artificial data that can sufficiently approximate model performance as if it were trained on the complete dataset. The authors achieve this by formulating a method that optimizes a small number of synthetic training samples. The approach leverages the differentiability of network weights relative to the synthetic data, facilitating the training of models with minimal data while maintaining performance close to that obtained with full datasets.

For instance, the paper demonstrates that the entire 60,000-image MNIST dataset can be encapsulated into just 10 distilled images—one per class—without substantially sacrificing performance. The distilled images achieve 94% test accuracy with a fixed network initialization, whereas the original dataset yields 99% accuracy.

Experimental Validation and Results

The authors conduct extensive experiments across various datasets, including MNIST, CIFAR-10, and others. They identify remarkable compressibility potential, wherein models trained on the synthetic datasets approach the performance metrics of those trained on their full counterpart datasets. Notable findings include:

MNIST and CIFAR-10: The distilled dataset on MNIST allows a LeNet model (with a fixed initialization) to achieve substantial accuracy (94%) when trained on just 10 images. On CIFAR-10, 100 distilled images yield significant accuracy (54%) with a fixed initialization.
Random Initializations: The paper navigates the challenge of synthesizing data effective across various initializations by optimizing the distilled dataset for random initial weights. This adaptation is crucial for its applicability in real-world scenarios where network initializations are not known beforehand.

Theoretical Insights and Algorithmic Framework

A significant contribution of the paper is the theoretical underpinning it provides for the dataset distillation process. The authors present an analysis for a linear model, establishing a lower bound on the number of synthetic samples required to match the training performance of a full dataset. This analysis elucidates the dependency of distilled samples' effectiveness on the distribution of network initializations—a key consideration for practical deployments.

The proposed algorithm iteratively optimizes the synthetic dataset by backpropagating through the gradient update steps, akin to gradient-based hyperparameter optimization methods. This process leverages efficient Hessian-vector products, enabling scalability to more complex networks.

Extensions and Applications

Beyond the foundational dataset distillation, the paper explores extensions such as model adaptation with pretrained weights and malicious dataset poisoning. These applications illustrate the method's versatility:

Model Adaptation: Distilled datasets effectively adapt models pretrained on one domain to perform optimally on another, addressing domain mismatch in applications.
Dataset Poisoning: The technique can synthesize data to induce targeted misclassification, demonstrating a potential avenue for adversarial attacks and model robustness evaluation.

Implications and Future Directions

The dataset distillation strategy proposed in this work holds considerable implications for efficient model training—especially relevant in resource-constrained settings. The ability to distill and transport essential learning signals in minimal data formats could transform pre-training and fine-tuning practices, making deep learning more accessible and less dependent on extensive computational resources.

Future research could explore extending this approach to other data types (e.g., audio and text) and further investigating alternative network initialization schemes for improved generalization across varying configurations. Additionally, applications in privacy-preserving learning where data integrity and minimization are paramount could benefit from these distillation techniques.

In summary, dataset distillation emerges as a promising strategy for data-efficient learning, circumventing traditional data constraints and opening new avenues for both research and deployment in machine learning.