Soft-Label Dataset Distillation and Text Dataset Distillation (1910.02551v3)

Published 6 Oct 2019 in cs.LG, cs.AI, and stat.ML

Abstract: Dataset distillation is a method for reducing dataset sizes by learning a small number of synthetic samples containing all the information of a large dataset. This has several benefits like speeding up model training, reducing energy consumption, and reducing required storage space. Currently, each synthetic sample is assigned a single hard' label, and also, dataset distillation can currently only be used with image data. We propose to simultaneously distill both images and their labels, thus assigning each synthetic sample asoft' label (a distribution of labels). Our algorithm increases accuracy by 2-4% over the original algorithm for several image classification tasks. Using `soft' labels also enables distilled datasets to consist of fewer samples than there are classes as each sample can encode information for multiple classes. For example, training a LeNet model with 10 distilled images (one per class) results in over 96% accuracy on MNIST, and almost 92% accuracy when trained on just 5 distilled images. We also extend the dataset distillation algorithm to distill sequential datasets including texts. We demonstrate that text distillation outperforms other methods across multiple datasets. For example, models attain almost their original accuracy on the IMDB sentiment analysis task using just 20 distilled sentences. Our code can be found at $\href{https://github.com/ilia10000/dataset-distillation}{\text{https://github.com/ilia10000/dataset-distillation}}$.

PDF Abstract

Soft-Label Dataset Distillation and Text Dataset Distillation: An Expert Overview

The paper "Soft-Label Dataset Distillation and Text Dataset Distillation" authored by Ilia Sucholutsky and Matthias Schonlau from the University of Waterloo presents advancements in the field of dataset distillation. This research introduces methods to enhance dataset distillation, emphasizing the generation of soft labels for synthetic datasets and extending distillation techniques to text and sequential data. This approach not only addresses current limitations in distillation capabilities but also demonstrates improved performance across a variety of tasks.

Key Contributions

Introduction of Soft Labels: The primary innovation in this paper is the use of soft labels in dataset distillation. Unlike conventional methods that assign a 'hard' label to each synthetic sample, restricting them to represent a single class, soft labels allow for distributions over multiple classes. This adjustment has been shown to improve classification accuracy on image-based datasets by 2-4% over the traditional single-label approach.
Reduction in Dataset Size: By employing soft labels, the authors demonstrate that it is feasible to reduce the number of synthetic samples needed to represent class information. For instance, in the case of the MNIST dataset, they achieve over 96% accuracy with only 10 synthetic images and nearly 92% with just 5 images. This reduction is significant, presenting crucial implications for storage efficiency and training speed in machine learning models.
Extension to Text and Sequential Data: The authors extend the distillation process beyond image data to include text datasets. This expansion involves embedding text into continuous representations, similar to images, which allows for the distillation of complex natural language tasks. The paper reveals that distilled text datasets maintain high accuracies with significantly fewer samples than traditional methods. For example, nearly original accuracy is maintained in an IMDB sentiment analysis task using only 20 distilled sentences.
Generalization Across Network Initializations: Sucholutsky and Schonlau explore the robustness of the distilled datasets across different network initializations. The findings suggest that while performance is lower with random initialization compared to fixed initialization, the datasets still achieve high accuracies, indicating the distilled samples effectively capture the essential knowledge of the original datasets.

Experimental Validation

The paper supports its claims with extensive experiments involving both image and text datasets. It employs LeNet for MNIST image classification and AlexCifarNet for CIFAR10, showing that soft-label dataset distillation outperforms existing baselines across various settings. For text, convolutional and recurrent networks were used on datasets like IMDB, SST5, and TREC, further establishing the benefits of this method.

Implications and Future Directions

The proposed soft-label dataset distillation substantially enhances model training efficiency by reducing dataset size without compromising accuracy. This development holds promise for applications in model deployment environments where computational resources and storage capacity are constrained. Additionally, extending dataset distillation to text data opens new avenues for efficient NLP model training.

From a theoretical perspective, this work prompts further inquiry into the nature of label distributions and their impact on network generalization. Future research could explore distillation across varying neural architectures, further optimizing soft-label initialization and embedding techniques for text data.

In conclusion, Sucholutsky and Schonlau's work represents a meaningful advancement in dataset distillation, providing a robust framework for improving the efficiency of machine learning models while maintaining high performance across various data modalities.

PDF Markdown Bookmark Chat (Pro)

Authors (2)

Ilia Sucholutsky (45 papers)
Matthias Schonlau (10 papers)

Citations (121)

View on Semantic Scholar

Related Papers

Find Related Papers

GitHub

GitHub - ilia10000/dataset-distillation: Soft-Label Dataset Distillation and Text Dataset Distillation (74 stars)

Tweets

https://twitter.com/prajdabre1/status/1759209883726131661