Dataset Distillation via Factorization (2210.16774v1)

Published 30 Oct 2022 in cs.CV and cs.LG

Abstract: In this paper, we study \xw{dataset distillation (DD)}, from a novel perspective and introduce a \emph{dataset factorization} approach, termed \emph{HaBa}, which is a plug-and-play strategy portable to any existing DD baseline. Unlike conventional DD approaches that aim to produce distilled and representative samples, \emph{HaBa} explores decomposing a dataset into two components: data \emph{Ha}llucination networks and \emph{Ba}ses, where the latter is fed into the former to reconstruct image samples. The flexible combinations between bases and hallucination networks, therefore, equip the distilled data with exponential informativeness gain, which largely increase the representation capability of distilled datasets. To furthermore increase the data efficiency of compression results, we further introduce a pair of adversarial contrastive constraints on the resultant hallucination networks and bases, which increase the diversity of generated images and inject more discriminant information into the factorization. Extensive comparisons and experiments demonstrate that our method can yield significant improvement on downstream classification tasks compared with previous state of the arts, while reducing the total number of compressed parameters by up to 65\%. Moreover, distilled datasets by our approach also achieve \textasciitilde10\% higher accuracy than baseline methods in cross-architecture generalization. Our code is available \href{https://github.com/Huage001/DatasetFactorization}{here}.

Citations (128)

View on Semantic Scholar

Summary

The paper introduces HaBa, a novel factorization technique that decomposes datasets into hallucinators and bases to generate highly informative synthetic data.
It employs adversarial contrastive constraints to increase diversity, reducing compressed parameters by up to 65% and boosting classification accuracy by about 10%.
The approach demonstrates strong generalization on datasets like CIFAR-10 and CIFAR-100, highlighting its potential for enhanced storage and computational efficiency.

An Expert Overview of "Dataset Distillation via Factorization"

The paper "Dataset Distillation via Factorization" introduces HaBa, a novel approach to dataset distillation (DD) that enhances data efficiency through a factorization technique. HaBa represents a significant shift from conventional methods by decomposing datasets into hallucinator networks and bases. This structure enables the generation of synthetic data with increased informativeness, leading to enhanced performance in downstream tasks.

Main Contributions

Hallucinator-Basis Factorization (HaBa): Unlike traditional methods that focus on creating a limited number of distilled samples, HaBa factorizes datasets into hallucinator networks and bases. The hallucinator networks synthesize images by leveraging relationships between samples, thereby expanding the representation capabilities exponentially.
Adversarial Contrastive Constraints: The paper introduces adversarial contrastive constraints to maximize diversity in the generated images. This is achieved by encouraging the hallucinators to produce diversified outputs from a common basis, minimizing correlation among images derived from different hallucinators.
Performance and Efficiency Gains: HaBa reduces the number of compressed parameters by up to 65% compared to state-of-the-art methods. On tasks such as image classification, datasets distilled via HaBa achieve approximately 10% higher accuracy than baseline methods, demonstrating its effectiveness in cross-architecture generalization.

Methodology Details

The core innovation in HaBa lies in treating dataset distillation as a factorization problem, dividing it into two distinct yet interconnected components: hallucinators and bases. Hallucinators learn relationships between samples, and bases store essential dataset information. The combination of these components generates a diverse set of synthetic training samples. The paper further refines this process by imposing adversarial contrastive constraints, enhancing the diversity and discriminative power of the hallucinated samples.

Experimental Results

Extensive experiments were conducted on various datasets, including SVHN, CIFAR-10, and CIFAR-100. The results consistently showed that HaBa outperforms existing state-of-the-art methods in terms of accuracy and data efficiency. For instance, in scenarios with highly distilled datasets (such as 1 image per class), HaBa demonstrated superior performance improvements.

Implications and Future Directions

HaBa offers a significant contribution to the dataset distillation domain by dramatically enhancing data efficiency. This can have practical implications for storage and computational efficiency, particularly in edge computing scenarios or environments with bandwidth limitations. Theoretical implications revolve around the potential for further research in dataset factorization and the application of similar methodologies to other areas of artificial intelligence, such as natural language processing.

In future developments, the exploration of class-wise relationships within the factorization framework could be an intriguing direction. Additionally, adapting HaBa to incorporate other modalities, like sound or video, could expand its applicability across various domains.

In summary, the HaBa framework advocates a shift from data condensation to data expansion through structured factorization, marking an important evolution in dataset distillation research. This work lays the groundwork for future explorations into efficient data representation and resource management within the AI community.

PDF Markdown

Related Papers

GitHub

GitHub - Huage001/DatasetFactorization: PyTorch implementation of paper "Dataset Distillation via Factorization" in NeurIPS 2022. (66 stars)