Papers
Topics
Authors
Recent
Search
2000 character limit reached

TSTR: Train on Synthetic, Test on Real

Updated 19 January 2026
  • TSTR is an evaluation protocol where models are trained solely on synthetic data and then tested on real data to measure utility and domain discrepancies.
  • It is applied across various domains including images, time series, and tabular data, using metrics like mAP, AUROC, and R² to benchmark performance.
  • The approach emphasizes the importance of generator quality and diagnostic tools like CKA and SHAP for ensuring semantic fidelity and effective domain adaptation.

Train on Synthetic, Test on Real (TSTR) is a rigorous evaluation protocol for assessing the capacity of synthetic data to replace or augment real data in supervised learning. In the TSTR setting, a model is wholly trained on synthetic data and then evaluated on a real, held-out test set. This protocol enables precise quantification of the domain gap and practical utility of synthetic data for downstream tasks. TSTR has been systematically analyzed across a variety of modalities (images, time series, tabular data, multimodal settings), tasks (object detection, segmentation, classification, temporal prediction), and with different synthetic data generation frameworks including GANs, VAEs, diffusion models, and simulation pipelines.

1. Core Definition and Mathematical Formalism

Let SsynS_{\mathrm{syn}} be a synthetic dataset (e.g. images xjx'_j, labels yjy'_j), and TrealT_{\mathrm{real}} a real test set of (xi,yi)(x_i, y_i). Under TSTR, a predictor fθf_\theta (parameterized by θ\theta) is trained on SsynS_{\mathrm{syn}}: θ=argminθ1Ssyn(x,y)SsynL(fθ(x),y)\theta^* = \arg\min_\theta \frac{1}{|S_{\mathrm{syn}}|} \sum_{(x', y') \in S_{\mathrm{syn}}} \mathcal{L}(f_\theta(x'), y') and then evaluated on TrealT_{\mathrm{real}}: TSTR=1Treal(x,y)Treal1[fθ(x)=y]\mathrm{TSTR} = \frac{1}{|T_{\mathrm{real}}|} \sum_{(x, y) \in T_{\mathrm{real}}} 1[f_\theta(x) = y] For regression tasks, the metric may be RMSE, R2R^2, or other task-appropriate losses. TSTR performance is typically compared with a real-to-real (TRTR) baseline and the ratio or gap is analyzed to quantify the synthetic data’s utility (Ibrahim et al., 22 Oct 2025, Ljungqvist et al., 2023, Yu et al., 17 Nov 2025, Koochali et al., 2022, Murad et al., 4 Aug 2025, Ruis et al., 2024, Le et al., 2023, Hoffmann et al., 2019, Esteban et al., 2017).

2. Use Cases and Empirical Findings Across Modalities

TSTR has been employed in diverse research contexts:

  • Object Detection and Segmentation: Synthetic datasets such as GTAV or procedurally generated CAD scenes allow training detectors (YOLOv3, Mask R-CNN, Faster R-CNN, Swin Transformer) and evaluating generalization to real benchmarks like BDD100K, Cityscapes, or WISDOM-Real (Ljungqvist et al., 2023, Ruis et al., 2024, Danielczuk et al., 2018, Shen et al., 2023).
  • Satellite/Landcover Segmentation: Mask-conditional GANs with SPADE blocks produce synthetic satellite imagery to train U-Nets for per-pixel landcover, matching or surpassing real-trained models when synthetic and real are mixed (Le et al., 2023).
  • Time Series (Medical, Forecasting): RGAN/RCGAN, TimeDiff, and VAE+Diffusion models generate synthetic ICU/aviation time series. TSTR shows that state-of-the-art generators, particularly transformer-based and diffusion-based, can recover 94–97% of real-data utility in flight and ICU prediction tasks, with retention gaps ΔTSTR0.01\Delta_\mathrm{TSTR}\approx0.01 AUROC in MIMIC-III/eICU and up to 97% R2R^2 retention in aviation forecasting (Esteban et al., 2017, Ibrahim et al., 22 Oct 2025, Murad et al., 4 Aug 2025).
  • Multimodal Relation Extraction: TSTR can be used when only one modality is present at train time. Methods such as MI²RAGE leverage chained cross-modal generation and mutual-information-based filtering to construct synthetic multimodal data, achieving or surpassing state-of-the-art performance compared to real multimodal training (Du et al., 2023).
  • Tabular Data: In tabular domains, TSTR (often classification accuracy or AUROC) is the default utility metric for synthetic data in sensitive settings (healthcare, enterprise) (Yu et al., 17 Nov 2025, Murad et al., 4 Aug 2025).

3. Empirical Performance and Critical Phenomena

Observed TSTR performance exposes the domain adaptation gap and the efficacy of different strategies:

Domain / Task Real→Real Baseline TSTR (Synthetic→Real) Retention/Gain Reference
YOLOv3 on BDD (object det.) 0.43 mAP 0.12 mAP ΔTSTR\Delta_{TSTR}\approx0.31 (Ljungqvist et al., 2023)
U-Net (sat. landcover, mIoU) 0.52 0.41–0.42  80%~80\% of real; mixing: 0.58 (Le et al., 2023)
ICU Mortality (AUROC) 0.88 0.87–0.88 ΔTSTR0.01\Delta_{TSTR}\leq0.01 (Ibrahim et al., 22 Oct 2025)
Aviation Forecast (R2^2) 0.44 0.43 98% retention (Murad et al., 4 Aug 2025)
ImageNet1K, ResNet-50 (Top-1) 79.6% 70.9–76.0% 89–95% (at >6×>6\times synth) (Yuan et al., 2023)

Synthetic data matched or surpassed real-only training with increased synthetic diversity, improved generators, and careful domain translation or augmentation. However, pronounced failures occur when high-level scene statistics, class-conditional cues, or domain-specific textures are not aligned; e.g., domain gap in detector heads (Ljungqvist et al., 2023), mode collapse (Koochali et al., 2022), or overfitting to synthetic artifacts (Hoffmann et al., 2019). Mixing synthetic with a small amount of real data consistently boosts utility (Le et al., 2023, Lee et al., 2024).

4. Diagnostic Analyses and Layerwise Similarity

TSTR provides a practical test of end-to-end generalization, but various papers have supplemented this with more diagnostic metrics:

  • CKA (Centered Kernel Alignment): Layerwise CKA similarity between real-trained and synthetic-trained models reveals that low-level features are robust (CKA >0.9>0.9 in early layers), but mid- and high-level features (detection head, scene semantics) diverge sharply (CKA as low as $0.05-0.2$ in the detection head) and correlate with mAP drop (Ljungqvist et al., 2023).
  • Feature Distance Metrics: Measures such as train2test Mahalanobis distance and APt2t_{\text{t2t}} expose how well synthetic-augmented pools cover the real test feature space and relate directly to AP; lower distances predict higher detection accuracy (Lee et al., 2024).
  • SHAP Attribution Distance: In tabular settings, the SHAP Distance reveals that TSTR accuracy can remain high even when the model’s semantic reasoning patterns diverge from real-trained models, thus TSTR should be viewed as a minimal utility criterion rather than a guarantee of semantic fidelity (Yu et al., 17 Nov 2025).
  • Ablation and Sampling Strategies: Adversarial (student–teacher) sampling, mutual-information filtering, and automated curriculum methods improve TSTR by focusing generation or training on hard or informative synthetic samples, mitigating overfitting and domain collapse (Du et al., 2023, Hoffmann et al., 2019, Kerim et al., 2024).

5. Methodological Innovations and Recommendations

Multiple strategies have improved TSTR outcomes:

6. TSTR Limitations, Analytical Insights, and Future Work

TSTR is a powerful, practical measure but not a comprehensive guarantee:

  • Blind Spots: TSTR will not reveal if synthetic data reproduces incorrect label reasoning, has semantic gaps, or over-smooths rare/corner cases; attribution-based checks or task-specific diagnostics are required (Yu et al., 17 Nov 2025, Koochali et al., 2022).
  • Task Specificity: TSTR utility is tightly linked to the supervised task and may not generalize to new objectives or transfer learning regimes. Precision/recall/F1, calibration, and robustness metrics are essential to supplement mAP or accuracy (Ljungqvist et al., 2023, Yu et al., 17 Nov 2025).
  • Assumptions: Many TSTR pipelines assume that Psyn(yx)Preal(yx)P_{syn}(y|x)\approx P_{real}(y|x), i.e., that synthetic and real domains share label semantics. If this fails, the TSTR metric conflates distribution and annotation artifacts.
  • Generator Quality is Pivotal: The best performing frameworks for TSTR use deep distribution-matching objectives (MMD, guidance, attribute-driven prompting, mutual information selection) and, when provided with sufficient diversity and scale, outperform GANs or Copula models (Yuan et al., 2023, Ibrahim et al., 22 Oct 2025, Kerim et al., 2024, Murad et al., 4 Aug 2025). Scaling up synthetic set cardinality partially substitutes for narrower synthetic fidelity, with observed linear-to-saturating gains (Yuan et al., 2023).

Future directions include curriculum-based or online TSTR, active domain adaptation during deployment, further extension to more complex or multimodal tasks, automated generator evaluation and selection, and formal privacy overlays for sensitive domains (Ibrahim et al., 22 Oct 2025, Du et al., 2023, Kerim et al., 2024).

7. Summary Table: TSTR Evaluation Landscape

Study / Domain Generator/Methodology TSTR Metric/Result Recommendations / Limitations
Object Detection (Ljungqvist et al., 2023) YOLOv3 on GTAV/BDD mAP, 0.12 (vs 0.43 real) Fine-tune middle layers, diversify cues
Satellite Segmentation (Le et al., 2023) Mask-cond. SPADE-GAN mIoU, 0.41–0.42 Mix 50-50 real/synth for best mIoU
Tabular, Aviation (Murad et al., 4 Aug 2025) REaLTabFormer, TabSyn R2R^2 retention: 94–97% Use transformer autoregressive models
ICU Time Series (Ibrahim et al., 22 Oct 2025) Enhanced TimeAutoDiff ΔTSTR0.01\Delta_\mathrm{TSTR} \le 0.01 AUROC Add MMD/consistency penalties
ImageNet Classification (Yuan et al., 2023) Diffusion+MMD+CLIP Top-1: 70.9–76% Scale synthetic data, combine w/ guidance
Multimodal Extraction (Du et al., 2023) MI²RAGE, CCG+MI Filter F1: 92.8 (surpasses real) Diversity and high MI essential
2D Pose (Hoffmann et al., 2019) Synthetic Humans+Augmentation mAP: 13.4 (synthetic) Teacher-student, focus on hard samples

TSTR is now a foundational evaluation for learning with synthetic data, offering actionable diagnosis of the synthetic–real gap and providing clear guidance for synthetic data generation and task-specific transfer. Best practices mandate pairing TSTR with diversity/attribution-based diagnostics and optimizing both distributional match and label fidelity for target deployments.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (16)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Train on Synthetic, Test on Real (TSTR).