Trajectory Guided Dataset Distillation (TGDD)

Updated 10 December 2025

The paper introduces a trajectory matching framework that optimizes synthetic datasets to follow the multi-step learning dynamics of expert networks.
It leverages techniques such as contrastive objectives, dynamic regularization, and adaptive trajectory buffers to overcome limitations of traditional gradient matching.
Empirical results demonstrate that TGDD improves transferability, adversarial robustness, and performance across vision, language, and multimodal benchmarks.

Trajectory Guided Dataset Distillation (TGDD) refers to a class of dataset distillation methods that synthesize a small, highly-informative synthetic dataset by optimizing it to match the long-range parameter evolution (“trajectory”) of a model trained on the original, much larger dataset. In contrast to earlier gradient-matching or distribution-matching approaches, TGDD aligns the optimization dynamics—rather than merely one-step gradients or moment-matched features—between the model trained on synthetic and real data, across substantial stretches of training. This alignment aims to ensure that models trained from scratch on the synthetic data will not just learn the same static representations but will follow a dynamic learning path similar to real data, thereby preserving generalization and, with recent advances, even adversarial robustness, cross-modality, and extreme sample efficiency.

1. Foundations and Trajectory Matching Principle

TGDD seeks to match the multi-step training trajectory of a neural network on real data by carefully optimizing synthetic data points such that, when a model is trained for several steps from a reference initialization, its parameters closely track (in $\ell_2$ or other topology) those of an “expert” network that was trained on the full dataset. The core procedure is as follows (Cazenavette et al., 2022):

Train an expert network on the real dataset, saving snapshots $\{\theta_t^*\}_{t=0}^T$ to form a parameter trajectory.
At each distillation iteration:
- Sample an anchor epoch $t$ .
- Initialize a student at $\hat\theta_t = \theta_t^*$ .
- Train the student for $N$ steps on the synthetic set $\mathcal{S}$ with learning rate $\alpha$ , possibly using differentiable augmentations.
- Compare the student's final parameter $\hat\theta_{t+N}$ to the expert’s parameter after $M$ steps $\theta_{t+M}^*$ (typically, $\{\theta_t^*\}_{t=0}^T$ 0).
- Minimize the normalized squared weight distance:
$\{\theta_t^*\}_{t=0}^T$ 1
Update the synthetic data and/or learning rate by backpropagating through the student’s unrolled training steps.

TGDD’s trajectory-matching framework generalizes short-range gradient matching (DC/DSA) by focusing on multi-step, long-range optimization dynamics. This preserves not only one-step gradients but more global learning signals, empirically producing synthetic sets that generalize better, adapt across architectures, and remain robust as data grows in cardinality (Cazenavette et al., 2022, Guo et al., 2023).

Several developments extend the vanilla trajectory-matching approach to address limitations related to stability, scalability, efficiency, and diversity:

Difficulty-Aligned Trajectory Matching (DATM): Aligns the portion of the expert trajectory used for matching with the cardinality of the synthetic set (e.g., matching early “easy” patterns with small sets, and late, “hard” patterns as the synthetic set grows). This enables "lossless" distillation up to the size of the real dataset and closes the generalization gap that persists in low-IPC regimes (Guo et al., 2023).
Flat/Smoothed Trajectory Buffers: Regularizes the expert trajectories to promote smooth, flat parameter paths, leveraging losses such as gradient penalties and clipping. Smoother expert trajectories suppress the accumulation of initialization errors during student alignment, leading to better transfer and generalization (Du et al., 2022, Shen et al., 2023).
Convexified Trajectory Matching (MCT): Involves interpolating expert checkpoint parameters along the line segment (or path-length normalized curve) between initial and final model weights, providing a stable, noise-robust expert trajectory, reducing storage cost, and allowing for continuous-time matching (Zhong et al., 2024).
Automatic Training Trajectories (ATT): Rather than fixing the matching step size, dynamically selects at each iteration the depth of the student unroll that provides the closest match to the expert’s target, mitigating the Accumulated Mismatching Problem and improving stability/adaptivity (Liu et al., 2024).
Contrastive Learning-Enhanced Matching: Augments trajectory matching with SimCLR-style contrastive objectives, especially useful under extreme sample scarcity (e.g., IPC=1), to ensure that synthetic samples remain semantically well-separated in latent space (Li et al., 21 May 2025).
Progressive and Partial-Matching Schemes: Progressive trajectory extension and partial update strategies improve stability and prevent mode collapse in specialized domains (e.g., medical imaging or high IPC), often complemented by dynamic regularization to maintain diversity (Yu et al., 2024, Lee et al., 2024).

3. Mathematical Objective and Pseudocode

The canonical TGDD loss is:

$\{\theta_t^*\}_{t=0}^T$ 2

Here:

$\{\theta_t^*\}_{t=0}^T$ 3: expert parameter at step $\{\theta_t^*\}_{t=0}^T$ 4,
$\{\theta_t^*\}_{t=0}^T$ 5: number of student GD steps on $\{\theta_t^*\}_{t=0}^T$ 6 (synthetic),
$\{\theta_t^*\}_{t=0}^T$ 7: teacher trajectory length to match (real),
$\{\theta_t^*\}_{t=0}^T$ 8: final student weights after $\{\theta_t^*\}_{t=0}^T$ 9 steps,
Normalization prevents degenerate solutions and balances contributions across trajectory segments.

A representative distilled data update loop, as in (Cazenavette et al., 2022), is:

$t$ 0

Extensions add parallel loss terms (e.g., contrastive, maximum mean discrepancy for class diversity, overlap regularizers), or modify the buffer (e.g., convexified or smoothed trajectories).

4. Empirical Performance and Key Results

TGDD variants set state-of-the-art performance across vision, language, and multimodal benchmarks:

On CIFAR-10/100 and TinyImageNet, baseline TGDD achieves substantial improvements over prior DC/DSA/DM approaches, especially at low IPC. Example: CIFAR-10 IPC=1, MTT 46.2%, TGDD (contrastive-enhanced) 53.0% (Li et al., 21 May 2025).
DATM achieves “lossless” distillation (synthetic set performance indistinguishable from real data) at high IPC by aligning the synthetic set’s “difficulty” coverage to late-stage expert trajectories (Guo et al., 2023).
MCT improves stability and convergence speed, with lower memory and storage overhead, matching or exceeding the final accuracy of standard MTT (Zhong et al., 2024).
Flat trajectory regularization produces consistently higher cross-architecture transfer and faithfully preserves ranking for neural architecture search (Du et al., 2022).
Medical dataset distillation with progressive matching and dynamic overlap mitigation yields >8% absolute accuracy gains over previous SOTA at extreme data scarcity (ipc=2) (Yu et al., 2024).
TGDD for NLP, vision-language, and instruction tuning (using embedding-regularized prompt tokens) matches or outperforms the best coreset/data selection methods with remarkable compression ratios and transferability (Yao et al., 14 Apr 2025, Wu et al., 2023).

5. Extensions: Robustness, Diversity, and Modality

Recent TGDD research goes beyond clean accuracy to address robustness and cross-modality:

Adversarial Matching (MAT): Embeds adversarial robustness directly by modeling the expert buffer as an adversarially trained trajectory (e.g., via PGD), with or without Exponential Moving Average smoothing for stability. Training on distilled data produced by MAT yields substantially higher adversarial accuracy even without adversarial augmentation in the student, across several datasets and attack types (Lai et al., 15 Mar 2025).
Balanced Distribution TGDD: Dynamically aligns feature distributions and injects class-overlap penalties along the model trajectory, yielding synthetic sets that manifest both high semantic diversity and class compactness, boosting high-resolution performance even on large-scale ImageNet subsets (Ran et al., 2 Dec 2025).
Cross-modal and Language TGDD: Trajectory-matching for prompt embeddings, vision-language pairs, and even multi-modal tasks enables compressing training dynamics into extremely low-cardinality synthetic sets. Embedding regularization enables cross-architecture transfer, including LLMs and vision-language transformers (Yao et al., 14 Apr 2025, Wu et al., 2023).

6. Limitations, Open Challenges, and Practical Considerations

Despite its empirical success and conceptual generality, TGDD faces several open challenges:

Expert Trajectory Storage: Standard multi-trajectory matching (MTT) incurs high storage and pre-computation cost. Convexified trajectory buffers (MCT) and LoRA-based matching (for transformer models) ameliorate but do not eliminate this bottleneck (Zhong et al., 2024, Wu et al., 2023).
Hyperparameter Sensitivity: Distillation performance can be sensitive to trajectory segment selection, step sizes, regularization coefficients, and augmentations; curriculum-style schedules (DATM) and automatic selection (ATT) help, but require careful tuning (Guo et al., 2023, Liu et al., 2024).
Cross-Architecture Fidelity: While ATT and flat/robust buffer variants improve cross-architecture transfer, strictly “lossless” distillation remains architecture-specific. Closing this gap is an active area of research (Guo et al., 2023, Liu et al., 2024).
Scalability to Very Large Models: Efficient unrolling, memory-efficient implicit differentiation, and modular buffer learning are essential when scaling TGDD to modern vision/language transformers in practical settings (Yao et al., 14 Apr 2025, Wu et al., 2023).

TGDD establishes a systematic, trajectory-centric foundation for dataset distillation, encompassing methods for not only compressing data, but also encoding complex learning dynamics—including robustness, diversity, and transferable generalization—within highly compact synthetic datasets. Its continued evolution reflects both the practical demand for efficient data utilization and the theoretical pursuit of compressible, generalizable learning dynamics.