Dataset Distillation via Trajectory Matching

Updated 19 November 2025

The paper introduces a novel dataset distillation approach that minimizes trajectory discrepancies between networks trained on real and synthetic data.
It integrates innovations like difficulty-aligned matching, partial updates, and contrastive enhancements to ensure robust performance across varied modalities.
The method demonstrates high-fidelity distillation, achieving near full-data accuracy in image, text, vision-language, and other sensitive domains.

Dataset distillation by matching training trajectories is a paradigm that synthesizes small, highly informative synthetic datasets by explicitly aligning a model’s learning dynamics on synthetic data with those observed during training on the original data. This approach, which generalizes beyond one-step or gradient-matching schemes, has proven to be a foundational methodology for high-fidelity dataset reduction across supervised and self-supervised contexts, images, vision-language, text, and even OOD or privacy-sensitive domains.

1. Trajectory Matching: Core Formulation and Motivation

The key principle of trajectory-matching dataset distillation is to minimize the discrepancy between parameter trajectories of neural networks trained on real versus synthetic data. Let $\mathcal{D}_{\mathrm{real}}$ be the full dataset and $\mathcal{D}_{\mathrm{syn}}$ the sought-after distilled set. Given an “expert” trajectory $\tau^* = \{\theta^*_t\}_{t=0}^T$ from training on $\mathcal{D}_{\mathrm{real}}$ , the synthetic data are optimized such that, after initializing the “student” network at some $\theta^*_t$ and training for $N$ steps on $\mathcal{D}_{\mathrm{syn}}$ , the parameters $\hat\theta_{t+N}$ closely match the expert’s later state $\theta^*_{t+M}$ .

The canonical loss is normalized squared distance: $\mathcal{L}_{\mathrm{traj}} = \frac{\|\hat\theta_{t+N} - \theta^*_{t+M}\|_2^2}{\|\theta^*_{t+M} - \theta^*_t\|_2^2}$ This loss can be minimized over image pixels, label vectors, and sometimes per-image learning rates. By backpropagating through the entire sequence of synthetic-data updates, the optimization explicitly shapes synthetic samples to induce the same long-range parameter shifts as the original dataset (Cazenavette et al., 2022).

Unlike single-step or gradient-matching methods, trajectory matching ensures fidelity over many update steps, addressing compounding errors and the drift from real-data behavior that such short-range approaches induce.

2. Major Innovations and Algorithmic Variants

Trajectory matching underpins a range of dataset distillation frameworks, with diverse algorithmic innovations to address its inherent trade-offs.

Difficulty-Aligned Matching

Difficulty-Aligned Trajectory Matching (DATM) exploits the empirical observation that DNNs learn “easy” patterns early and “hard” patterns late during training. By aligning the synthetic set’s effective “difficulty” to its size—i.e., matching early trajectory segments for small sets and later ones for large sets—DATM achieves “lossless” distillation at large data budgets and avoids overfitting in the low-IPC regime (Guo et al., 2023). This windowed trajectory-matching approach is implemented via a progressive schedule on the matched portion of the expert path.

Partial Update and Hybrid Selection

SelMatch combines trajectory matching with a selection-based initialization: a fraction $(1-\alpha)$ of synthetic samples are initialized from real data of IPC-tuned difficulty, and only the remaining $\alpha$ fraction are iteratively distilled (updated) by trajectory matching. This preserves rare or hard features present in difficult real samples and remedies the coverage gap suffered by trajectory-only methods at large synthetic budgets (Lee et al., 28 May 2024). The synthetic set’s difficulty is tightly controlled via a window sliding over real data sorted by difficulty scores (C-score, Forgetting, etc.), and the value of $\alpha$ is tuned downwards as IPC grows.

Filtering and Alignment

Prioritize Alignment in Dataset Distillation (PAD) further addresses misalignment by pruning real training data according to EL2N difficulty scores to suit the compression ratio, and restricting trajectory matching to deep layers, filtering out low-level or easily-mimicked information. This two-stage approach effectively matches both the "what" and "how" of information transfer, yielding further performance improvements (Li et al., 6 Aug 2024).

Contrastive Enhanced Matching

In settings with extremely low synthetic budgets, instance-level contrastive objectives (e.g., SimCLR-style InfoNCE loss) are incorporated into the inner-loop of trajectory matching to enhance feature diversity and semantic discrimination of synthetic samples (Li et al., 21 May 2025). This hybridization is crucial when single images per class are required to capture intra-class spread, especially for edge or resource-constrained deployments.

Storage-Efficient and Stable Trajectories

Conventional trajectory matching requires storing dozens of expert checkpoints. Matching Convexified Trajectory (MCT) replaces the noisy SGD path with a convex combination (straight-line segment) between initial and final expert weights. This reduces storage from $O(KW)$ to $O(W)$ , enables continuous interpolation and sampling along the trajectory, and stabilizes the distillation process (Zhong et al., 28 Jun 2024).

Automatic/Adaptive Trajectory Matching

Automatic Training Trajectories (ATT) eliminates phase mismatch by dynamically selecting, at each iteration, the step along the student’s trajectory that best matches the expert’s target, rather than fixing the matching horizon a priori. This mitigates “accumulated mismatching error” and enhances robustness and generalization, especially for transfer across architectures (Liu et al., 19 Jul 2024).

3. Cross-Domain and Modality Extensions

Trajectory matching has been adopted and extended in a wide range of settings beyond canonical image classification:

Text and LLMs: Pseudo prompt embeddings are learned for LLMs through trajectory matching, then quantized for cross-architecture transfer via nearest-neighbor mapping in token embedding space. This approach enables instruction-tuning with only 5% of the original data, outperforming selection baselines and supporting OPT → Llama transfer (Yao et al., 14 Apr 2025).
Vision-Language: Multi-modal trajectory-matching distillation is achieved by joint optimization of synthetic image–text pairs under a bi-directional contrastive (InfoNCE) loss and trajectory alignment, with LoRA parameterization applied for memory efficiency in transformers (Wu et al., 2023).
Medical Imaging: Several variants—High-Order Progressive Trajectory Matching (HoP-TM) (Dong et al., 29 Sep 2025), progressive matching with dynamic overlap mitigation (Yu et al., 20 Mar 2024)—address instability and diversity collapse issues in highly imbalanced or privacy-constrained datasets, using stagewise matching and distributional regularization.
Wi-Fi and Time-Series: In WiDistill, trajectory matching is used for distillation of Wi-Fi CSI time series, compressing domain-specific, high-dimensional sensor data for activity recognition while preserving cross-architecture generalization (Wang et al., 5 Oct 2024).

4. Quantitative Benchmarks and Empirical Findings

Trajectory-matching distillation methods, including DATM, PAD, SelMatch, contrastive-enhanced variants, and MCT, consistently outperform random subset selection and one-step matching on vision tasks.

Representative results (test accuracy, ConvNet, key Image/Cls – IPC values) from the literature:

Method	CIFAR-10 IPC=10	CIFAR-10 IPC=50	CIFAR-100 IPC=10	TinyImageNet IPC=10
Random	31.0	50.6	14.6	1.4
MTT	65.4	71.6	39.7	8.8
DATM	66.8	76.1	47.2	13.6
SelMatch	85.9	90.4	54.5	44.7
PAD	67.4	77.0	47.8	—
Contrastive TM	70.2	77.6	48.7	14.8
MCT	66.0	72.3	42.5	22.6

Notably, in higher IPC regimes, SelMatch and DATM achieve “lossless” performance (matching full-data accuracy) at substantially reduced data sizes (Guo et al., 2023, Lee et al., 28 May 2024), and SelMatch improves coverage of hard test cases. On text and vision-language benchmarks, trajectory-matching approaches markedly outperform selection/coreset baselines (e.g., for LLM instruction-tuning, trajectory-matching distilled data match or exceed LESS-selected subsets (Yao et al., 14 Apr 2025); in vision-language, recall@1 improves by up to 661% over random for small synthetic budgets (Wu et al., 2023)).

5. Practical Considerations, Challenges, and Open Questions

Key factors influencing trajectory-matching performance and adoption:

Expert trajectory computation is resource-intensive, especially for large models or high-resolution data; MCT and LoRA provide partial mitigation.
Memory and computation: Backpropagation through many synthetic steps scales linearly in the number of steps due to efficient autodiff (Hessian-vector products) but is VRAM-intensive for large models.
Hyperparameter tuning: Parameters such as window locations (difficulty), update fractions ( $\alpha$ in SelMatch), learning rates, and augmentation regimes require careful tuning to achieve best results for different dataset sizes and complexities.
Cross-architecture generalization: While trajectory matching induces model-agnostic functional shifts, certain approaches (e.g., ATT (Liu et al., 19 Jul 2024), WiDistill (Wang et al., 5 Oct 2024)) provide enhanced transferability.

Limitations persist regarding scalability to truly massive corpora (e.g., >100M samples or >100B-parameter models), as well as extension to structured-output tasks (detection, segmentation), and automatic online difficulty estimation (to choose trajectory matching phases without full grid search).

A plausible implication is that trajectory-matching frameworks—especially those integrating difficulty alignment, selection/partial freezing, and regularization for diversity—represent the most robust route to high-fidelity, lossless dataset distillation across a spectrum of modalities and real-world constraints. Further developments in efficient trajectory storage, adaptive objective scheduling, and generalization to structured or multi-modal output spaces remain active research directions.

6. References

Cazenavette, G. et al., “Dataset Distillation by Matching Training Trajectories” (Cazenavette et al., 2022).
Guo, K. et al., “Towards Lossless Dataset Distillation via Difficulty-Aligned Trajectory Matching” (Guo et al., 2023).
Zhu, W., et al., “SelMatch: Effectively Scaling Up Dataset Distillation via Selection-Based Initialization and Partial Updates by Trajectory Matching” (Lee et al., 28 May 2024).
Zhao, X., et al., “Prioritize Alignment in Dataset Distillation” (Li et al., 6 Aug 2024).
Li, Z., et al., “Contrastive Learning-Enhanced Trajectory Matching for Small-Scale Dataset Distillation” (Li et al., 21 May 2025).
Ren, T., et al., “Towards Stable and Storage-efficient Dataset Distillation: Matching Convexified Trajectory” (Zhong et al., 28 Jun 2024).
Zhang, Z. et al., “Dataset Distillation by Automatic Training Trajectories” (Liu et al., 19 Jul 2024).
Meng, Y., et al., “Transferable text data distillation by trajectory matching” (Yao et al., 14 Apr 2025).
Lin, Y. et al., “Vision-Language Dataset Distillation” (Wu et al., 2023).
Bian, J. et al., “High-Order Progressive Trajectory Matching for Medical Image Dataset Distillation” (Dong et al., 29 Sep 2025).
Zhang, Y. et al., “Progressive trajectory matching for medical dataset distillation” (Yu et al., 20 Mar 2024).