Papers
Topics
Authors
Recent
Search
2000 character limit reached

Geometric Characterisation and Structured Trajectory Surrogates for Clinical Dataset Condensation

Published 23 Apr 2026 in cs.LG | (2604.21638v1)

Abstract: Dataset condensation constructs compact synthetic datasets that retain the training utility of large real-world datasets, enabling efficient model development and potentially supporting downstream research in governed domains such as healthcare. Trajectory matching (TM) is a widely used condensation approach that supervises synthetic data using changes in model parameters observed during training on real data, yet the structure of this supervision signal remains poorly understood. In this paper, we provide a geometric characterisation of trajectory matching, showing that a fixed synthetic dataset can only reproduce a limited span of such training-induced parameter changes. When the resulting supervision signal is spectrally broad, this creates a conditional representability bottleneck. Motivated by this mismatch, we propose Bezier Trajectory Matching (BTM), which replaces SGD trajectories with quadratic Bezier trajectory surrogates between initial and final model states. These surrogates are optimised to reduce average loss along the path while replacing broad SGD-derived supervision with a more structured, lower-rank signal that is better aligned with the optimisation constraints of a fixed synthetic dataset, and they substantially reduce trajectory storage. Experiments on five clinical datasets demonstrate that BTM consistently matches or improves upon standard trajectory matching, with the largest gains in low-prevalence and low-synthetic-budget settings. These results indicate that effective trajectory matching depends on structuring the supervision signal rather than reproducing stochastic optimisation paths.

Summary

  • The paper's main contribution is introducing BTM, a quadratic Bézier surrogate that replaces noisy SGD trajectories in clinical dataset condensation.
  • It reveals a representability bottleneck in traditional trajectory matching by demonstrating the constraints imposed by the effective gradient span.
  • Empirical results across clinical datasets show BTM enhances predictive performance, cross-architecture robustness, and reduces storage by up to 33×.

Geometric Characterisation and Structured Trajectory Surrogates for Clinical Dataset Condensation

Introduction and Motivation

The challenge of dataset condensation (DC) is critical in domains with substantial governance constraints and operational bottlenecks, such as healthcare, where the storage, transfer, and sharing of large-scale datasets is problematic. In such settings, DC constructs compact synthetic datasets that preserve the training utility of the original data. Trajectory Matching (TM), a prevalent DC approach, supervises synthetic data via model parameter changes observed during real-data training. This paper rigorously analyses the geometric structure underlying TM, formulates its inherent representability bottleneck, and introduces Bézier Trajectory Matching (BTM)—a principled surrogate that replaces stochastic trajectories with low-complexity quadratic Bézier curves.

Geometric Analysis of Trajectory Matching

TM operates by attempting to reproduce real-data training trajectories, formalised as the sequence of teacher parameter displacements {Δk}\{\boldsymbol{\Delta}_k\}. For any fixed synthetic dataset, the student’s reachable gradient span is limited to Gk\mathcal{G}_k, determined by the synthetic data and inner-loop optimizer. The paper proves that TM is essentially a constrained subspace approximation problem: the residual LTM\mathcal{L}_{\mathrm{TM}} is lower-bounded by the norm of supervision components outside Gk\mathcal{G}_k. When teacher supervision is spectrally broad, substantial mass lies outside this span, yielding a rank-based bottleneck that prevents the student from reproducing high-rank or diffuse trajectories (Figure 1). Figure 1

Figure 1

Figure 1

Figure 1: Comparison between raw SGD teacher trajectories and Bézier trajectory surrogates; Bézier curves effect a smoother, low-complexity connection and induce a more stable optimisation profile for TM.

The analysis demonstrates that the representability bottleneck is conditional on the effective dimension of the reachable span; as teacher trajectories become more variable, matching becomes strictly more difficult.

Bézier Trajectory Matching: Structured Surrogates

To address the spectral mismatch, BTM replaces teacher trajectories with quadratic Bézier surrogates between model initialisation and solution. Each surrogate is defined by a single control point optimised to minimise average path loss. The paper proves that all segment displacements between endpoints of a quadratic Bézier curve belong to a two-dimensional subspace, critically reducing the effective rank of supervision. This low-dimensional family suppresses diffuse, weakly reinforced directions inherently present in SGD trajectories.

Spectral empirical analysis across multiple clinical datasets shows that BTM surrogates exhibit far stronger concentration of displacement energy compared to both raw SGD and convexified trajectories. The control point optimisation is target-aware: rather than smoothing by interpolation, it minimizes the density of loss along the trajectory, consistently prioritising functionally meaningful and task-aligned progress. Figure 2

Figure 2: BTM achieves superior cross-architecture generalisation under IPC=500, outperforming DATM especially under larger architecture shifts.

Experimental Evaluation

BTM is evaluated across five clinical datasets spanning tabular and time-series modalities (Oxford, Portsmouth, Birmingham NHS ED cohorts; eICU; MIMIC-III). The experiments focus on low-prevalence and low-budget regimes, as these settings induce spectrally broad and weakly aligned supervision signals. Across all tasks, BTM consistently matches or outperforms strong trajectory-based baselines such as DATM and MCT (convexified trajectory matching), with improvements most pronounced in AUPRC—a metric critical under label imbalance.

Large gains are observed in low-prevalence settings (e.g., Birmingham NHS cohort, 0.8% prevalence: 15.5%-17.9% AUPRC improvement over baselines), suggestive of the utility of rank-reduced, task-optimised surrogates in difficult clinical prediction contexts. BTM also demonstrates robust cross-architecture transferability; condensed datasets can be used to train models with different architectural inductive biases while retaining most of their predictive utility. Figure 3

Figure 3: Bézier surrogates reduce storage requirements by up to 33×33\times in clinical settings compared to SGD trajectories.

Storage and Complexity Ablations

The storage efficiency of BTM is substantial; each trajectory is represented by three parameter vectors, unrelated to the original number of SGD checkpoints—yielding 20×20\times-33×33\times reductions in memory footprint across datasets (Figure 3). Ablation studies further establish that quadratic Bézier surrogates outperform linear and convexified alternatives across most settings for AUPRC, confirming the benefit of task-optimised curvature. Figure 4

Figure 4: Quadratic Bézier trajectories consistently outperform linear and convexified alternatives across five clinical datasets for AUPRC (IPC=50, 500).

Theoretical and Practical Implications

The geometric bottleneck identified in TM motivates the structuring, rather than replication, of supervision signals for dataset condensation. BTM’s low-rank surrogate family aligns the trajectory signal with the constraints of condensed data optimisation—filtering stochastic noise while preserving dominant, functionally meaningful progress. This framework is theoretically generalisable to other condensation domains where scalability and governance are priorities, and decouples supervision from architecture-specific optimization.

Empirically, BTM’s robust performance, storage reductions, and architecture-agnostic utility suggest it is a highly practical solution in healthcare and federated learning contexts. However, it does not provide formal privacy guarantees; it's recommended to combine BTM with differential privacy during deployment in governed domains.

Future Directions

Advancing BTM entails expanding surrogate families to richer adaptive curves and integrating formal privacy guarantees such as differentially private SGD. Combining trajectory structuring with fairness analyses and external validation is critical for responsible clinical deployment. The geometric framework may inform meta-learning, coreset selection, and generative distillation approaches, further accelerating DC in resource-constrained environments.

Conclusion

This paper provides a geometric characterisation of trajectory matching for dataset condensation and establishes a conditional representability bottleneck that limits the efficacy of high-rank supervision. Bézier Trajectory Matching (BTM) remedies this with structured, low-rank surrogates, resulting in improved downstream prediction, stronger cross-architecture robustness, and significant storage savings. The results have broad significance for DC in healthcare, supporting efficient model development under governance and operational constraints. Future work should focus on extending surrogate trajectory families and integrating privacy-preserving mechanisms for practical, scalable deployment (2604.21638).

Paper to Video (Beta)

No one has generated a video about this paper yet.

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Collections

Sign up for free to add this paper to one or more collections.

Tweets

Sign up for free to view the 1 tweet with 0 likes about this paper.