Dataset Distillation

Updated 4 December 2025

Dataset distillation is a method for synthesizing a small synthetic dataset that enables models to reach comparable generalization performance as when trained on full data.
It employs bi-level optimization and matching techniques—such as gradient, trajectory, and distribution matching—to align the learning dynamics of synthetic and original datasets.
Key benefits include faster training, reduced storage and communication costs, and enhanced privacy, making it valuable for applications like federated learning and neural architecture search.

Dataset distillation is a methodology for synthesizing a compact synthetic dataset from a large labeled collection such that models trained on the synthetic data reach comparable generalization performance to those trained on the full original dataset. The primary motivation is to enable fast training, reduce storage and transmission costs, facilitate privacy (when direct real-data sharing is restricted), and create task-specific data surrogates for scenarios such as federated learning or neural architecture search (Yu et al., 2023, Lei et al., 2023).

1. Formal Objectives and Mathematical Frameworks

Let the full real dataset be $D = \{(x_i, y_i)\}_{i=1}^N$ , with $x_i$ in data space and $y_i$ labels. Dataset distillation aims to construct a much smaller synthetic set $S = \{(\tilde{x}_j, \tilde{y}_j)\}_{j=1}^M$ , $M \ll N$ , such that training a model $f_\theta$ on $S$ yields generalization on real data that closely approximates training on $D$ (Wang et al., 2018, Yang et al., 6 Jun 2024).

The canonical bi-level optimization formulation is: $\begin{aligned} &\min_{S}\; L\bigl(f_{\theta^*(S)}(D)\bigr) \ &\ \ \ \text{s.t.}\ \theta^*(S) = \arg\min_{\theta}\ L\big(f_{\theta}(S)\big) \end{aligned}$ where the “inner” loop solves model fitting on the synthetic set, and the “outer” loop ensures that the resulting model performs well on the full dataset (Yang et al., 6 Jun 2024, Yu et al., 2023).

Alternate frameworks recast distillation as single-level matching problems, e.g., matching feature or gradient statistics computed over $D$ and $S$ (distribution matching, gradient matching), or by directly optimizing the representational distance between trajectories of models trained on full versus distilled data (trajectory matching) (Liu et al., 2023, Chen et al., 2023).

2. Methodological Taxonomy

Dataset distillation algorithms can be systematically organized as follows (Yu et al., 2023, Yang et al., 6 Jun 2024):

Performance Matching (Meta-Learning/BPTT): Optimize synthetic data so that after network training, the parameter set performs optimally on the original data. Classical bi-level approaches employ backpropagation through time to update synthetic data by differentiating through unrolled training updates (Wang et al., 2018).
Parameter Matching:
- Gradient Matching: Align the gradient of the loss on $S$ with that on $D$ at various points during training. This can be done for a single update or over multiple steps (Yu et al., 2023).
- Trajectory Matching: Match the entire or partial training dynamics (sequence of parameters) of a model trained on real vs. synthetic data (Yang et al., 6 Jun 2024, Feng et al., 2023).
Distribution Matching: Align distributional statistics (e.g., mean features under random embeddings, higher-order moments) of $D$ and $S$ in an appropriate representational space. Methods include mean-feature matching (MMD), Wasserstein distance-based barycenter matching, or optimal quantization in latent space (Liu et al., 2023, Tan et al., 13 Jan 2025).
Parameterization and Data Space: Synthetic data may live in pixel space or in a compressed latent space parameterized by autoencoders or generative models, with some recent methods explicitly optimizing over quantized latents for compression efficiency (Bao et al., 23 Jul 2025, Duan et al., 2023).
Label Distillation: Synthetic labels can be fixed, learnable, or “soft” as in knowledge distillation regimes (Yu et al., 2023).
Vision–Language & Multi-modal Integration: Some contemporary approaches incorporate textual prototypes and vision-language diffusion to capture higher-level semantic information, enhancing downstream generalization (Zou et al., 30 Jun 2025).

3. Empirical Properties and Information Content

Empirical evaluation consistently demonstrates that dataset distillation can drastically reduce the sample size required for effective downstream performance, especially on standard benchmarks (e.g., CIFAR-10/100, TinyImageNet, ImageNet-1K). For instance, with as little as 10–50 images per class, distilled datasets can enable ConvNet architectures to reach over 70% test accuracy versus 86% for full-data on CIFAR-10 (Yu et al., 2023, Yang et al., 6 Jun 2024). Factorized and knowledge-distillation-infused methods enable ImageNet-1K distillation to 50 images/class with 30–60% top-1 accuracy, and further compress the storage cost by quantizing latents and decoders (Bao et al., 23 Jul 2025).

Recent investigations show distilled data predominantly encode features corresponding to the early training dynamics of real-data models. This is supported by:

High agreement between distilled-trained and early-stopped models on real data.
Early saturation of accuracy when evaluating a real-trained model on a fixed distilled set.
Hessian-trace analysis indicating distilled sets yield rapid “learning” and quickly become flat in the loss landscape (Yang et al., 6 Jun 2024).

Distilled data are not simple compressions or prototypical subsets. Rather, each point can exhibit highly targeted semantic influence on real-test samples, sometimes covering fine-grained attributes even beyond class labels (Yang et al., 6 Jun 2024). However, such data lack substitutability: training a model on a mixture of few distilled and real samples often degrades performance compared to pure distilled data, contrary to the cumulative gains in real-data training (Yang et al., 6 Jun 2024).

4. Scalability, Compression, and Architecture Transfer

A major focus in recent research is increasing the scalability of dataset distillation. Advances include:

Rate–Utility Optimization: Jointly optimizing Shannon entropy (bit-cost) of the compressed synthetic dataset and model utility, enabling up to 170× compression over previous methods (e.g., 13 bits/class for CIFAR-10 at >77% accuracy) (Bao et al., 23 Jul 2025).
Pushforward Optimal Quantization: Reformulating disentangled distillation as minimizing the Wasserstein-2 distance between the latent distribution of the real data and its optimal quantization, leading to sample-efficient and provably consistent synthetic sets under generative priors (Tan et al., 13 Jan 2025).
Latent Space Parameterization: Encoding synthetic data in the latent space of pretrained generative models dramatically reduces storage and computation while supporting high-res images and large label spaces (Duan et al., 2023, Li et al., 10 May 2025).

Architecture overfitting remains a challenge: synthetic data often inherit the biases of the model or architecture used during distillation. Model-agnostic, distribution-matching approaches or techniques that employ a diverse “model pool” during optimization improve cross-architecture generalization, with explicit regularization via teacher–student knowledge distillation further enhancing robustness (Zhou et al., 20 Feb 2024).

5. Bias Propagation and Mitigation

Dataset distillation can amplify certain spurious biases present in the source data. In bias injection studies, severe amplification arises for “easy” color or background shortcuts (e.g., CMNIST, BG-FMNIST), where distilled sets encode these biases more strongly than the original full data, leading to 50–70% drops in out-of-bias test accuracy compared to modest 4–17% drops from real-data training. By contrast, corruption biases (e.g., noise, blur) are typically suppressed in distillation, as their “hardness” resists condensation and high-frequency components cancel out (Cui et al., 6 Jun 2024).

A practical mitigation is KDE-based sample reweighting: real-data samples with high density in feature space (i.e., likely to be bias-dominated) are downweighted in the matching loss, thereby forcing distilled data to capture intrinsic task features. This approach leads to dramatically improved performance under severe bias settings, surpassing the efficacy of alternative debiasing methods (Cui et al., 6 Jun 2024).

6. Advanced Paradigms and Application-Specific Formulations

A growing body of work recognizes that distilled sets' effectiveness depends crucially on the task-specific nature of the downstream inference. A principled, task-oriented formalism for dataset distillation requires specifying the statistical inference operation (e.g., conditional estimation, generalization to OOD), the training operator, and even the choice of coherent risk functional—otherwise the optimization is under-specified and subject to vacuous solutions (Kungurtsev et al., 2 Sep 2024). Gradient-, distribution-, or trajectory-matching methods can be interpreted as one-level approximations to the true task-driven KKT stationarity system.

Recent extensions include:

Vision–Language Distillation: Combining learned visual prototypes and LLM-generated textual prototypes within guided diffusion unlocks both semantic precision and improved classification accuracy under challenging settings (Zou et al., 30 Jun 2025).
Progressive Distillation: Sequentially synthesizing subsets for different training regimes (early/late dynamics), then cumulatively training on their union, closes much of the remaining performance gap to full-data training and supports large synthetic sets (Chen et al., 2023).
Domain-specific and Multi-modal Distillation: Application-driven DD enables structured knowledge integration for medical data with heterogeneous feature sets and out-of-distribution robustness for physics-informed neural networks via risk-optimized synthetic boundary data (Kungurtsev et al., 2 Sep 2024).

7. Current Limitations and Future Directions

Key challenges in dataset distillation include:

Scaling to large datasets and high resolutions: While latent-based and diffusion-driven methods partially address this, optimizing directly in pixel space remains prohibitive for ImageNet-1K scale without bespoke architectures or factorization strategies (2301.15547, Tan et al., 13 Jan 2025).
Cross-architecture generalization: Methods leveraging a model pool or knowledge distillation during and after the synthetic set creation show progress, but generalization to radically different backbones (e.g., transformer vs. ConvNet) is not yet competitive with real data (Zhou et al., 20 Feb 2024, Chen et al., 2023).
Bias, robustness, and privacy: Handling source data biases, supporting differentially private distillation, and resisting adversarial attacks or backdoor leakage present open fronts (Cui et al., 6 Jun 2024, Yu et al., 2023).
Task-specific design: The emerging consensus is to formulate the distillation objective in application-driven terms, specifying both the inference functional and risk criteria, potentially by integrating core information extraction and purposeful learning (Kungurtsev et al., 2 Sep 2024).

Future research will likely explore continuous rate–utility tradeoffs, architecture-free distillation, multi-modality, theoretical sample complexity bounds, and principled bias/robustness interventions, with the overarching aim of making dataset distillation a first-class tool across domains and data modalities (Bao et al., 23 Jul 2025, Liu et al., 2023, Kungurtsev et al., 2 Sep 2024).