Dataset Distillation Methods

Updated 10 February 2026

Dataset distillation is the process of compressing large datasets into a compact set of synthetic samples that retain model generalization.
Methods leverage bi-level meta-learning, gradient/trajectory matching, and generative approaches such as diffusion and quantization to optimize training dynamics.
These techniques address storage, computational, and privacy challenges, offering practical benefits in federated learning, NAS, and robust cross-architecture applications.

Dataset distillation aims to synthesize a small, highly informative set of synthetic samples from large datasets such that training on this surrogate yields models closely matching the generalization of those trained on the full dataset. This reduction addresses challenges of storage, compute, privacy, and efficiency in large-scale machine learning. The last half-decade has seen a proliferation of methods, spanning optimization paradigms, data domains, and robustness targets, progressing from bi-level meta-learning to diffusion-based synthesis and distributional quantization. This article surveys core methodological advances, representative algorithms, theoretical foundations, and practical implications, referencing cutting-edge approaches documented in arXiv literature.

1. Formalization and Core Objectives

Dataset distillation formalizes the compression of a large training set $T=\{(x_i, y_i)\}_{i=1}^N$ into a compact synthetic set $S=\{(s_j, \hat{y}_j)\}_{j=1}^M$ , with $M\ll N$ and $\hat{y}_j$ typically soft or hard labels, so that a model trained on $S$ achieves test risk similar to training on $T$ . The canonical objective is

$S^* = \arg\min_S\, \mathbb{E}_{(x,y)\sim \mathcal{D},\,\theta^{(0)}\sim P}\big[\ell(f_{\mathrm{alg}(S, \theta^{(0)})}(x), y)\big]$

where $\mathrm{alg}$ denotes the learning procedure (SGD, KRR, etc.), $\ell$ is the task loss (e.g., cross-entropy), and $\mathcal{D}$ is the data distribution (Lei et al., 2023).

Modern methods refine this by matching not only performance but aspects of model training dynamics: gradients at each step, full parameter trajectories, or high-order statistics in feature space. Extensions now target high-resolution images, sequence/text, and cross-modal pairs (Sucholutsky et al., 2019, Wu et al., 2023).

2. Methodological Taxonomies and Strategies

2.1. Meta-Learning (Bi-Level) Paradigms

Early and conceptually fundamental techniques (Wang et al., 2018, Lei et al., 2023) explicitly optimize the inner learning outcome via bi-level objectives: $\min_{S}~R_T(\theta^*(S)) \quad \text{where} \quad \theta^*(S) = \arg\min_\theta~R_S(\theta)$ with $R_T$ evaluated on the full dataset and $R_S$ on the synthetic. These frameworks typically require differentiating through the unrolled learning process (Backpropagation Through Time, BPTT).

Challenges include:

High memory and compute (due to gradient unrolling)
Instability of meta-gradients for large $T$ .

Recent advances like Random Truncated BPTT (RaT-BPTT) mitigate this by randomizing truncation windows, reducing both variance and memory (Feng et al., 2023).

2.2. Matching Dynamics: Gradient, Trajectory, and Distribution

Gradient Matching (DC, DSA, IDC): Match per-step, class-wise, or factorized gradients at successive points in training between real and synthetic sets (Lei et al., 2023, Chen et al., 2023).
Trajectory Matching (MTT, FTD, TESLA): Align entire parameter updates or "training curves" of models trained on real and distilled data (Chen et al., 2023, Lei et al., 2023).
Distribution Matching (DM, WMDD): Minimize feature-space discrepancies—the maximum mean discrepancy (MMD) or Wasserstein distance (WMDD)—between real and synthetic data mapped by a pre-trained encoder (Liu et al., 2023).

Progressive strategies such as Progressive Dataset Distillation (PDD) and Sequential Subset Matching (SeqMatch) further decompose the synthetic set into temporally or sequentially optimized blocks, explicitly matching the dynamics at different phases of training to overcome the limits of a single, static distilled subset (Chen et al., 2023, Du et al., 2023).

2.3. Generative and Factorized Approaches

With the rise of powerful generative models, several methods now perform distillation in the latent space of VAEs, GANs, or diffusion models:

Textual Inversion & Diffusion-Prompt Distillation (D3M, D³HR): Reduce entire classes to a single learnable prompt token; images are generated conditional on this prompt via large pre-trained diffusion models, dramatically reducing storage requirements and improving diversity (Abbasi et al., 2024, Zhao et al., 23 May 2025).
Optimal Quantization/Barycenter Matching: Interpreting dataset distillation as an instance of optimal transport, these approaches quantize the latent (or feature) distribution according to the Wasserstein metric, synthesizing images from quantized centroids (Tan et al., 13 Jan 2025, Liu et al., 2023).

2.4. Importance Weighting and Pruning

Recent algorithms integrate parameter- or data-sample-level importance into the loss, often via adaptive weights or core-set selection:

Importance-Aware Adaptive Dataset Distillation (IADD): Learns per-parameter weights, focusing the distillation objective on "hard but not impossible" to match dimensions, boosting accuracy and cross-arch generalization (Li et al., 2024).
Parameter and Instance Pruning: Iteratively prune network weights or exclude loss-dominated ("easy" or "hard") training samples to focus on the most informative subspaces, improving both within- and cross-architecture robustness (Li et al., 2022, Moser et al., 2024).

2.5. Label and Region Augmentation

To improve learning signals and cross-architecture transfer, dense label augmentation and local-region refinement have been proposed:

Label-Augmented Dataset Distillation (LADD): Generate soft, dense labels for all spatial sub-images within a synthetic sample, amplifying the semantic supervisory signal with little storage overhead and yielding consistent +10–30% gains (Kang et al., 2024).
Non-Critical Region Refinement (NRR-DD): Use class activation maps to preserve critical instance pixels while refining non-critical regions to enforce class-general statistics. Combined with Distance-Based Representative (DBR) transfer, this nearly eliminates the need for full soft-label storage at scale (Tran et al., 24 Mar 2025).

3. Algorithmic and Practical Considerations

Method	Optimization Paradigm	Scaling Complexity*	Benchmark Accuracy (CIFAR-10@IPC=10)
MTT	Trajectory Matching	$O(T\cdot\|S\|+ \text{unroll})$	~65.3%
KIP	NTK/GP Kernel Ridge Regression	$O(\|T\| + \|S\|)$	~62.7%
DM	Distribution Matching (MMD/W2)	$O(\text{Layers}\cdot\|S\|)$	48–67%
PDD (w/MTT, IDC)	Progressive, Trajectory/Grad	$\approx$ base method + $O(P)$	+1–5% vs single set
D3M, D³HR, DDOQ	Diffusion/Quantization	$O(N)$ encoding, $O(k)$ generation	Best SOTA at moderate/high IPC

*Scaling estimates abstracting over image size/depth and the number of classes. See (Lei et al., 2023, Chen et al., 2023, Abbasi et al., 2024, Tan et al., 13 Jan 2025).

Practical Elements

Synthetic Labeling: Soft-label distillation and dense sub-image label augmentation routinely improve accuracy and are now standard in high-performance pipelines (Sucholutsky et al., 2019, Kang et al., 2024).
Compression and Storage: Diffusion-model and prompt-based approaches yield up to 50 $\times$ storage reduction over raw synthetic images at fixed IPC (Abbasi et al., 2024).
Memory and Runtime: Single-level, adversarial-matching (DD-APM) and distribution-matching methods are preferred for very high resolution or resource-constrained settings due to their avoidance of nested optimization (Chen et al., 2023).
Cross-architecture and cross-modal robustness: Model-pool distillation and label/dense-region augmentation, along with distribution/barycenter-based matching, have proven essential for strong transfer to architectures or modalities unseen during distillation (Zhou et al., 2024, Kang et al., 2024, Liu et al., 2023).

4. Evaluation Benchmarks and Quantitative Results

Representative large-scale and cross-architecture results:

Dataset/Model	Method	IPC	Test Acc (%) (ResNet-18 unless noted)	Notable Comparative
CIFAR-10	PDD+IDC	10	67.9 ± 0.2	IDC: 67.5
	PDD+MTT	10	66.9 ± 0.4	MTT: 65.3
CIFAR-100	PDD+IDC	10	45.8 ± 0.5	IDC: 45.1
Tiny-ImageNet	D3M	50	51.43	RDED: 41.45
ImageNet-1K	DDOQ	10	33.1	D⁴M: 27.9, SRe²L: 21.3
CIFAR-10 (X-arch)	IADD→Res18	10	54.9	MTT: 46.4
ImageNet-1K	NRR-DD	10	46.1	RDED: 42.0

Key ablations:

PDD shows that multi-phase, progressive union training is critical; training on the union or sequentially on isolated subsets can underperform by up to 8% (Chen et al., 2023).
Loss-based pruning (“prune first, distill after”) achieves up to +5.2 percentage point improvements, even after removing up to 80% of the training set before distillation (Moser et al., 2024).
Adversarial prediction matching (DD-APM) allows distilling ImageNet-1K at only 10% the original size, yet achieves ~94% of full-data accuracy and excels over SOTA in cross-arch transfer (Chen et al., 2023).

5. Theoretical Underpinnings and Open Challenges

Distillation as bi-level optimization (meta-learning) links directly to hyperparameter optimization, with additional statistical complexities arising from random network initialization and varying condition numbers of neural loss landscapes (Wang et al., 2018, Lei et al., 2023).
Gradient/trajectory matching is motivated by empirical findings of simplicity bias and phase transitions in DNN training, and the sequential/PDD approach draws support from curriculum learning theory and forgetting dynamics (Chen et al., 2023).
Distributional methods connect to optimal transport and quantization, with formal rates for convergence in Wasserstein space and feature-space projection (Tan et al., 13 Jan 2025, Liu et al., 2023).
Limitations remain in scalability (billions of pixels, 1000-class datasets), cross-modality generalization, and theoretical risk guarantees. Most approaches are still sensitive to the architecture or initialization used during distillation, with architecture-agnostic or model-pool methods only partially closing the gap (Zhou et al., 2024).

6. Applications, Extensions, and Future Prospects

Application Domains

Continual and Federated Learning: Distilled sets serve as compact replay buffers or communication surrogates, improving learning with strict bandwidth, privacy, or device constraints (Lei et al., 2023).
Medical Data and Privacy: Synthetic X-ray distillation delivers nearly full-data diagnostic accuracy with orders-of-magnitude data reduction, enabling private or federated workflows (Li et al., 2024).
NAS/AutoML Proxies: Distilled sets accelerate architecture search by providing efficient proxies for full-dataset training (Lei et al., 2023, Chen et al., 2023).
Vision-Language: Trajectory and contrastive matching yield successful image-text co-distillation for retrieval (Wu et al., 2023).

Prospects and Open Problems

The scaling of diffusion/prompt and distribution-matching quantization for high-resolution or multi-modal/structured-label domains.
Integration of advanced importance-weighting, dynamic sample selection, and regional refinement to further boost efficiency and transferability.
Deeper theoretical understanding: developing PAC- or information-theoretic bounds linking synthetic set size, architecture class, and risk.
Model-invariant or privacy-preserving distillation, including differentially private mechanisms and membership-inference guarantees (Lei et al., 2023).

Dataset distillation remains an actively evolving area, now integrating advances in generative models, optimal transport, and cross-domain meta-learning. Recent methods demonstrate substantial progress in compression, generalization, and compute/disk efficiency, but fundamental scalability, robustness, and theoretical limits are still under study.