Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
102 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Scaling Up Dataset Distillation to ImageNet-1K with Constant Memory (2211.10586v4)

Published 19 Nov 2022 in cs.CV and cs.AI

Abstract: Dataset Distillation is a newly emerging area that aims to distill large datasets into much smaller and highly informative synthetic ones to accelerate training and reduce storage. Among various dataset distillation methods, trajectory-matching-based methods (MTT) have achieved SOTA performance in many tasks, e.g., on CIFAR-10/100. However, due to exorbitant memory consumption when unrolling optimization through SGD steps, MTT fails to scale to large-scale datasets such as ImageNet-1K. Can we scale this SOTA method to ImageNet-1K and does its effectiveness on CIFAR transfer to ImageNet-1K? To answer these questions, we first propose a procedure to exactly compute the unrolled gradient with constant memory complexity, which allows us to scale MTT to ImageNet-1K seamlessly with ~6x reduction in memory footprint. We further discover that it is challenging for MTT to handle datasets with a large number of classes, and propose a novel soft label assignment that drastically improves its convergence. The resulting algorithm sets new SOTA on ImageNet-1K: we can scale up to 50 IPCs (Image Per Class) on ImageNet-1K on a single GPU (all previous methods can only scale to 2 IPCs on ImageNet-1K), leading to the best accuracy (only 5.9% accuracy drop against full dataset training) while utilizing only 4.2% of the number of data points - an 18.2% absolute gain over prior SOTA. Our code is available at https://github.com/justincui03/tesla

Scaling Up Dataset Distillation to ImageNet-1K with Constant Memory

The paper presents a crucial advancement in the field of dataset distillation, a domain focused on compressing large datasets into smaller, informative synthetic datasets. This process aims to accelerate model training and decrease storage requirements while preserving the model performance achieved when trained on the full dataset. While prior methods in dataset distillation, such as trajectory-matching techniques (MTT), have shown state-of-the-art (SOTA) performance on image datasets like CIFAR-10/100, these methods struggle with scalability due to prohibitive memory requirements that arise from unrolling stochastic gradient descent (SGD) steps. This poses a significant challenge when attempting to distill information from large-scale datasets such as ImageNet-1K.

Key Contributions

  • Memory-efficient MTT: The authors introduce a novel method to compute the unrolled gradient with constant memory complexity, enabling MTT to scale to ImageNet-1K seamlessly while reducing the memory footprint by approximately six times. This innovation effectively reduces the memory complexity from being linear in the matching steps to constant.
  • Soft Label Assignment (SLA): The paper identifies difficulties MTT faces when handling datasets with numerous classes and proposes a soft label assignment method. This approach drastically improves convergence by using soft labels that facilitate information sharing across classes. The SLA is train-free and introduces no additional hyperparameters, yielding significant empirical improvements.

Results and Implications

The resulting algorithm, TESLA (Trajectory Matching with Soft Label Assignment), sets new SOTA on ImageNet-1K, demonstrating scalability up to 50 images per class (IPC) on a single GPU. This capacity is considerably higher than previous methods, which could only scale to 2 IPCs on ImageNet-1K. The method maintains a minimal accuracy drop of only 5.9% compared to training on the full dataset while using merely 4.2% of the total data points, marking an 18.2% absolute gain over previous SOTA results.

The implications of these advances are substantial both theoretically and practically. On a theoretical level, the memory-efficient computation approach pushes the boundaries of what is possible in dataset distillation, opening avenues for enhanced efficiency in handling large datasets. Practically, the reduction in memory requirement signifies that high-performance computing on massive datasets can be more feasible and accessible, offering extensive potential in fields where data storage and processing pose limitations.

Future Directions

The paper suggests several future directions worth exploring:

  1. Integration of Other Label Techniques: While SLA has shown effectiveness, exploring other methods for optimizing label assignment could further enhance MTT's performance, especially on datasets with high class counts.
  2. Removal of Teacher Dependency: TESLA currently operates effectively with a pre-trained teacher model. Investigating methods to eliminate dependence on such models could further streamline process efficiency and broaden applicability across various machine learning tasks.
  3. Application in Diverse Domains: Extending the use of TESLA beyond the traditional image datasets to other data types, such as text or tabular data, might reveal additional benefits and challenges, prompting new lines of inquiry in dataset condensation methods.

In conclusion, this paper lays the groundwork for future research endeavors to further scale dataset distillation techniques, enhancing computational accessibility in the processing of large-scale datasets across diverse AI applications.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (4)
  1. Justin Cui (9 papers)
  2. Ruochen Wang (29 papers)
  3. Si Si (24 papers)
  4. Cho-Jui Hsieh (211 papers)
Citations (102)
Github Logo Streamline Icon: https://streamlinehq.com

GitHub