Dataset Distillation Procedure

Updated 14 May 2026

Dataset distillation creates a smaller synthetic dataset to replicate the performance of larger datasets.
Applications include improving training speed, reducing resources, and achieving privacy by working with smaller data.
Techniques like gradient matching and distribution alignment ensure effective distillation across various data types.

Dataset distillation procedures aim to replace a large-scale dataset $\mathcal{T}$ with a substantially smaller synthetic set $\mathcal{S}$ that enables models trained on $\mathcal{S}$ to approximate the performance of models trained on the original data. This approach is motivated by the need for faster research iteration, lower resource consumption, improved eco-sustainability, and, increasingly, privacy and security requirements. Contemporary dataset distillation encompasses a spectrum of techniques, from bi-level optimization to distribution matching, and includes adaptations for images, graphs, text, tabular data, and settings with specific constraints such as differential privacy or federated operation.

1. Formal Problem Statement and Optimization Framework

Dataset distillation is commonly formulated as a bi-level optimization or surrogate thereof. The objective is to find a compact synthetic set $\widetilde{D}$ (arXiv "D̂"), typically $k \ll N$ points, such that training a (possibly randomized) learning algorithm on $\widetilde{D}$ yields parameters $\theta_{\widetilde{D}}$ whose performance on the original dataset (or unseen validation samples) matches that of $\theta_D$ trained on the full dataset $D$ : $\min_{\widetilde{D}} \ L_D(\theta_{\widetilde{D}}) \qquad \text{subject to} \qquad \theta_{\widetilde{D}} = \arg\min_\theta L_{\widetilde{D}}(\theta)$ where $\mathcal{S}$ 0 denotes the empirical risk over $\mathcal{S}$ 1. Due to the computational intractability of full bi-level optimization, surrogate objectives are employed:

Meta-model matching (full or truncated backpropagation through time) (Wang et al., 2018, Feng et al., 2023)
Gradient matching (alignment of first-order gradients per class) (Feng et al., 2023)
Distribution matching (statistical mean/feature alignment) (Sajedi et al., 2024)
Trajectory matching (full parameter trajectory alignment) (Yao et al., 14 Apr 2025)
Factorization, prototype, or generative model parameterization (Tan et al., 13 Jan 2025, Sajedi et al., 2024)

These generalize beyond classical model distillation by compressing datasets themselves, not models.

2. Core Methodological Approaches

The main procedures in dataset distillation fall into several categories, each with distinct algorithmic realizations and mathematical objectives.

2.1 Meta-model and Trajectory Matching

The original dataset distillation algorithm (Wang et al., 2018) and subsequent meta-learning-inspired variants solve the above bi-level problem by unrolling the inner optimization of $\mathcal{S}$ 2 (typically several SGD steps on $\mathcal{S}$ 3) and differentiating a held-out loss on the real data $\mathcal{S}$ 4 through this unrolled computation. The practical challenge is the cost (time, memory, instability) of reverse-mode differentiation through deep unrolls. Recent advances employ Random Truncated Backpropagation Through Time (RaT-BPTT), which randomly truncates the unroll window and backpropagates only over the last $\mathcal{S}$ 5 steps of a meta-iteration, reducing gradient variance and memory footprint and stabilizing optimization (Feng et al., 2023).

Trajectory-matching approaches attempt to match the sequence of model parameters obtained during training on $\mathcal{S}$ 6 to those generated when training on $\mathcal{S}$ 7, often using layerwise or aggregated distance metrics (Yao et al., 14 Apr 2025, Chen et al., 2023).

2.2 Gradient, Feature, and Distribution Matching

Gradient matching methods align the gradients of model parameters computed on batches of synthetic and real data, class-wise or globally. The Gradient Matching with Differentiable Siamese Augmentation (DSA) and Distribution Matching (DM) methods fall in this class, with DM specifically aligning feature averages (embeddings) of real and synthetic batches (Sajedi et al., 2024, Cui et al., 2024).

Distribution matching via feature statistics (means, variances at various network layers) bypasses the need for expensive inner-loop optimization, permitting scalable and stable distillation—especially amenable to high-resolution or large datasets (Sajedi et al., 2024, Tan et al., 13 Jan 2025).

2.3 Latent/Parameter Space and Generative-Model Distillation

Instead of directly optimizing in pixel or data space, one can distill knowledge into the parameters of a generative model. Data-to-Model Distillation (D2M) specifically aligns the representations (attention maps, logits, etc.) of real and generated images under random feature extractors, with loss functions comprising embedding and prediction matching (Sajedi et al., 2024). The distilled generative model can synthesize any number of informative samples (arbitrary images-per-class, IPC) upon request, without retraining.

Optimal Quantization/Pushed-Forward Distillation approaches recast distillation as a finite-support $\mathcal{S}$ 8 Wasserstein quantization problem in the latent space. A pre-trained encoder maps inputs to a latent distribution, which is quantized into a small set of centroids, and these are decoded (optionally via a diffusion prior) to produce the synthetic set. Dataset Distillation by Optimal Quantization (DDOQ) further optimizes quantizer weights and reweights synthetic samples during downstream learning for improved performance (Tan et al., 13 Jan 2025).

2.4 Enhancements: Progressive, Boosted, and Pruning-based Distillation

Progressive Dataset Distillation (PDD): Instead of a monolithic synthetic set, PDD splits training into $\mathcal{S}$ 9 phases, each with its own distilled subset. The union is built incrementally, mirroring the changing training dynamics of deep networks, and each subset is optimized conditioned on its predecessors (Chen et al., 2023).
Boosted Construction: Addressing subset intercorrelation issues, boosting-based methods construct synthetic sets in blocks, optimizing each consecutively while “freezing” or only partially updating previous blocks, yielding nested subsets for any target budget (Feng et al., 2023).
Loss-based Pruning: Filtering out high-loss (hard/noisy/outlier) real samples before distillation leads to distilled sets with improved cross-architecture generalization and test accuracy; only the lowest-loss (easiest) subset is kept per class (“Prune First, Distill After”) (Moser et al., 2024).

3. Extensions for Privacy, Robustness, and Special Regimes

3.1 Differentially Private Distillation

Standard dataset distillation procedures may leak sensitive information from the original data due to lack of formal privacy guarantees. DP-GenG mitigates this risk by combining DP-fine-tuned generators, DP feature matching, and expert-guided refinement under a strict privacy budget—composing $\mathcal{S}$ 0-GDP guarantees across all steps and converting them to approximate $\mathcal{S}$ 1-DP (Shi et al., 13 Nov 2025). The pipeline uses DP-generated data for initialization, feature matching with added noise, and DP-SGD on expert classifiers, with a budget allocation strategy to maximize utility under privacy constraints.

3.2 Federated Distillation and Noise-Robust Procedures

Secure Federated Data Distillation (SFDD): The distillation process is decentralized across clients, each privately holding a dataset. Gradient-matching losses are computed locally and only the updated synthetic samples are aggregated server-side. Label-differential privacy (LDPO-RLD) obfuscates synthetic label gradients with noise to prevent gradient-inversion attacks, with no raw data ever leaving clients (Arazzi et al., 19 Feb 2025).

Trust-Aware Diversion (TAD): When source labels are noisy, a dual-loop structure partitions data into “trusted” and “untrusted” regions: distillation focuses on the trusted subset, while an inner loop recalibrates and promotes reliable untrusted examples. The partition is based on a Gaussian mixture model fit to per-sample cross-entropy losses, with class-wise dynamic thresholds. This yields greater robustness under high noise rates (Wu et al., 7 Feb 2025).

3.3 Debiasing and Reweighting

Dataset distillation can amplify bias present in the original data. Sample reweighting using kernel density estimation (KDE) in feature space—down-weighting high-density (bias-aligned) samples and focusing on underrepresented instances—substantially reduces bias amplification and achieves order-of-magnitude performance gains in bias-conflicting test regimes (e.g., 23.8% $\mathcal{S}$ 291.5% on CMNIST for DM) (Cui et al., 2024).

4. Algorithmic Workflows and Pseudocode Structure

The procedural workflow differs with objective but commonly includes:

Initialization:
- Random, real data, or generative model-based initialization of synthetic samples.
- DP-GenG uses DP-generated examples from a fine-tuned generator (Shi et al., 13 Nov 2025).
Main Loop:
- Outer loop: optimize the synthetic set by differentiating a meta-loss (real-data performance, feature discrepancy, or trajectory deviation) through one or more unrolled steps of model training (Wang et al., 2018, Feng et al., 2023).
- Inner loop: standard training of model parameters on synthetic data for a fixed number of steps or until convergence.
Loss Computation:
- Compute appropriate loss: meta-model/trajectory matching, gradient matching, feature/statistics matching, or distribution matching.
- For DP-setting, add Gaussian noise to features/gradients and perform clip+aggregate for privacy guarantees (Shi et al., 13 Nov 2025).
Update:
- Synthetic data (or their latent representation/parameters) are updated via SGD or Adam, with learning rates and step schedules tailored to the modality and loss.
Output:
- The compact synthetic set, generator parameters, or centroids/weights in quantization-based methods.

A high-level canonical pseudocode structure is presented in multiple studies (Feng et al., 2023, Rosu et al., 2024, Tan et al., 13 Jan 2025, Shi et al., 13 Nov 2025), with specific hyperparameters and optimization settings determined empirically.

5. Modalities, Generalization, and Scalability

5.1 Data Modalities

Images: Standard for most algorithmic development, with pixel-based, latent, and generative model approaches all applied (Sachdeva et al., 2023).
Text: Adaptation via trajectory matching and prompt embedding learning for LLM instruction tuning; cross-architecture transfer enabled by prompt-to-token mapping (Yao et al., 14 Apr 2025).
Tabular Data: Bi-level optimization with explicit learning-rate scheduling, multi-architecture average to promote robust generalization (Medvedev et al., 2020, Rosu et al., 2024).
Graphs: Gradient/distribution matching on node features and adjacency representations (Sachdeva et al., 2023).

5.2 Practical Considerations and Scaling

Methods such as D2M and quantization-based distillation (DDOQ) scale linearly with the size of the distilled set, not the original data, and are resolution-agnostic (Sajedi et al., 2024, Tan et al., 13 Jan 2025).
D2M enables production of arbitrary-scale synthetic datasets post-distillation without retraining the generative model.
Progressive and boosting-based approaches enable construction of nested or expandable synthetic sets fit for variable downstream training budgets (Feng et al., 2023, Chen et al., 2023).
Strong cross-architecture generalization is observed for methods that explicitly avoid overfitting the synthetic set to one model's bias—through feature/statistics matching across model pools, pruning-based preprocessing, or explicit cross-architecture optimization (Sajedi et al., 2024, Moser et al., 2024, Medvedev et al., 2020, Xia et al., 5 Feb 2026).

6. Empirical Results, Trade-offs, and Key Insights

Empirical evaluations on standard benchmarks (CIFAR-10/100, TinyImageNet, ImageNet-1K, tabular datasets) consistently indicate:

High compression ratios: $\mathcal{S}$ 3– $\mathcal{S}$ 4 IPC can recover $\mathcal{S}$ 5– $\mathcal{S}$ 6\% of full-data accuracy (Sachdeva et al., 2023).
Matching, and sometimes exceeding, real-data training when synthetic samples are carefully optimized or pruned (Feng et al., 2023, Moser et al., 2024).
Noise- and label-robust variants yield significantly higher accuracy under high-noise, low-signal or imbalanced data regimes (Wu et al., 7 Feb 2025, Cui et al., 2024, Rosu et al., 2024).
Differentially private procedures show only moderate utility drops under tight privacy budgets when DP-generated initialization and robust feature matching are combined (Shi et al., 13 Nov 2025).
Cross-architecture generalization is maximized when distillation is performed on core-sets (pruning), across architecture pools, or under representation/statistical constraints (Moser et al., 2024, Sajedi et al., 2024, Xia et al., 5 Feb 2026).
Generative-model-based approaches (e.g., D2M) eliminate the need to re-distill for each IPC or architecture, further improving efficiency (Sajedi et al., 2024).

Empirical insights from DP-GenG show that initialization with private generative data (DP-Init) alone provides a $\mathcal{S}$ 7– $\mathcal{S}$ 8 percentage-point boost; expert-guidance further corrects DP noise-induced feature drift (Shi et al., 13 Nov 2025).

7. Limitations, Open Problems, and Future Directions

Bi-level DD remains expensive for deep or long-horizon problems due to unrolled differentiation and meta-gradient computation.
DD for modalities with discrete or structured data (e.g., graphs, text) is an active area; approaches like Farzi for sequences address the memory and optimization challenges by latent factorization and efficient reverse-mode differentiation (Sachdeva et al., 2023).
While factorized and generative-model-based distillation yield scalability and compression, their performance is contingent on the inductive bias and coverage of the generative prior.
Privacy, robustness to label noise and bias, and cross-domain generalization remain central. Techniques such as KDE-based reweighting, trust-aware diversion, and federated protocols will be increasingly prominent as data-sharing constraints grow.
Theoretical grounding (e.g., as in pushforward optimal quantization (Tan et al., 13 Jan 2025)) and formal privacy guarantees suggest future convergence of measurement-based and constructivist methodologies.

In summary, dataset distillation procedure encompasses a toolkit of methodologically diverse, theoretically grounded, and practically scalable techniques for compressing large data collections into highly informative synthetic subsets, with broad applicability across learning modalities, architectures, and constraints (Sachdeva et al., 2023, Wang et al., 2018, Feng et al., 2023, Shi et al., 13 Nov 2025, Tan et al., 13 Jan 2025, Sajedi et al., 2024).