TSTR: Train on Synthetic, Test on Real
- TSTR is an evaluation protocol where models are trained solely on synthetic data and then tested on real data to measure utility and domain discrepancies.
- It is applied across various domains including images, time series, and tabular data, using metrics like mAP, AUROC, and R² to benchmark performance.
- The approach emphasizes the importance of generator quality and diagnostic tools like CKA and SHAP for ensuring semantic fidelity and effective domain adaptation.
Train on Synthetic, Test on Real (TSTR) is a rigorous evaluation protocol for assessing the capacity of synthetic data to replace or augment real data in supervised learning. In the TSTR setting, a model is wholly trained on synthetic data and then evaluated on a real, held-out test set. This protocol enables precise quantification of the domain gap and practical utility of synthetic data for downstream tasks. TSTR has been systematically analyzed across a variety of modalities (images, time series, tabular data, multimodal settings), tasks (object detection, segmentation, classification, temporal prediction), and with different synthetic data generation frameworks including GANs, VAEs, diffusion models, and simulation pipelines.
1. Core Definition and Mathematical Formalism
Let be a synthetic dataset (e.g. images , labels ), and a real test set of . Under TSTR, a predictor (parameterized by ) is trained on : and then evaluated on : For regression tasks, the metric may be RMSE, , or other task-appropriate losses. TSTR performance is typically compared with a real-to-real (TRTR) baseline and the ratio or gap is analyzed to quantify the synthetic data’s utility (Ibrahim et al., 22 Oct 2025, Ljungqvist et al., 2023, Yu et al., 17 Nov 2025, Koochali et al., 2022, Murad et al., 4 Aug 2025, Ruis et al., 2024, Le et al., 2023, Hoffmann et al., 2019, Esteban et al., 2017).
2. Use Cases and Empirical Findings Across Modalities
TSTR has been employed in diverse research contexts:
- Object Detection and Segmentation: Synthetic datasets such as GTAV or procedurally generated CAD scenes allow training detectors (YOLOv3, Mask R-CNN, Faster R-CNN, Swin Transformer) and evaluating generalization to real benchmarks like BDD100K, Cityscapes, or WISDOM-Real (Ljungqvist et al., 2023, Ruis et al., 2024, Danielczuk et al., 2018, Shen et al., 2023).
- Satellite/Landcover Segmentation: Mask-conditional GANs with SPADE blocks produce synthetic satellite imagery to train U-Nets for per-pixel landcover, matching or surpassing real-trained models when synthetic and real are mixed (Le et al., 2023).
- Time Series (Medical, Forecasting): RGAN/RCGAN, TimeDiff, and VAE+Diffusion models generate synthetic ICU/aviation time series. TSTR shows that state-of-the-art generators, particularly transformer-based and diffusion-based, can recover 94–97% of real-data utility in flight and ICU prediction tasks, with retention gaps AUROC in MIMIC-III/eICU and up to 97% retention in aviation forecasting (Esteban et al., 2017, Ibrahim et al., 22 Oct 2025, Murad et al., 4 Aug 2025).
- Multimodal Relation Extraction: TSTR can be used when only one modality is present at train time. Methods such as MI²RAGE leverage chained cross-modal generation and mutual-information-based filtering to construct synthetic multimodal data, achieving or surpassing state-of-the-art performance compared to real multimodal training (Du et al., 2023).
- Tabular Data: In tabular domains, TSTR (often classification accuracy or AUROC) is the default utility metric for synthetic data in sensitive settings (healthcare, enterprise) (Yu et al., 17 Nov 2025, Murad et al., 4 Aug 2025).
3. Empirical Performance and Critical Phenomena
Observed TSTR performance exposes the domain adaptation gap and the efficacy of different strategies:
| Domain / Task | Real→Real Baseline | TSTR (Synthetic→Real) | Retention/Gain | Reference |
|---|---|---|---|---|
| YOLOv3 on BDD (object det.) | 0.43 mAP | 0.12 mAP | 0.31 | (Ljungqvist et al., 2023) |
| U-Net (sat. landcover, mIoU) | 0.52 | 0.41–0.42 | of real; mixing: 0.58 | (Le et al., 2023) |
| ICU Mortality (AUROC) | 0.88 | 0.87–0.88 | (Ibrahim et al., 22 Oct 2025) | |
| Aviation Forecast (R) | 0.44 | 0.43 | 98% retention | (Murad et al., 4 Aug 2025) |
| ImageNet1K, ResNet-50 (Top-1) | 79.6% | 70.9–76.0% | 89–95% (at synth) | (Yuan et al., 2023) |
Synthetic data matched or surpassed real-only training with increased synthetic diversity, improved generators, and careful domain translation or augmentation. However, pronounced failures occur when high-level scene statistics, class-conditional cues, or domain-specific textures are not aligned; e.g., domain gap in detector heads (Ljungqvist et al., 2023), mode collapse (Koochali et al., 2022), or overfitting to synthetic artifacts (Hoffmann et al., 2019). Mixing synthetic with a small amount of real data consistently boosts utility (Le et al., 2023, Lee et al., 2024).
4. Diagnostic Analyses and Layerwise Similarity
TSTR provides a practical test of end-to-end generalization, but various papers have supplemented this with more diagnostic metrics:
- CKA (Centered Kernel Alignment): Layerwise CKA similarity between real-trained and synthetic-trained models reveals that low-level features are robust (CKA in early layers), but mid- and high-level features (detection head, scene semantics) diverge sharply (CKA as low as $0.05-0.2$ in the detection head) and correlate with mAP drop (Ljungqvist et al., 2023).
- Feature Distance Metrics: Measures such as train2test Mahalanobis distance and AP expose how well synthetic-augmented pools cover the real test feature space and relate directly to AP; lower distances predict higher detection accuracy (Lee et al., 2024).
- SHAP Attribution Distance: In tabular settings, the SHAP Distance reveals that TSTR accuracy can remain high even when the model’s semantic reasoning patterns diverge from real-trained models, thus TSTR should be viewed as a minimal utility criterion rather than a guarantee of semantic fidelity (Yu et al., 17 Nov 2025).
- Ablation and Sampling Strategies: Adversarial (student–teacher) sampling, mutual-information filtering, and automated curriculum methods improve TSTR by focusing generation or training on hard or informative synthetic samples, mitigating overfitting and domain collapse (Du et al., 2023, Hoffmann et al., 2019, Kerim et al., 2024).
5. Methodological Innovations and Recommendations
Multiple strategies have improved TSTR outcomes:
- Synthetic Generation Strategies:
- Conditional mask-driven GANs (SPADE) and domain-randomized renderings for spatial/semantic alignment (Le et al., 2023, Tang et al., 2023, Danielczuk et al., 2018).
- Transformer-based and diffusion-based tabular, image, and time-series generators to maximize coverage of real-world joint distributions (Murad et al., 4 Aug 2025, Ibrahim et al., 22 Oct 2025, Yuan et al., 2023).
- Training Recipe:
- Pretrain backbones on real data and freeze early feature statistics when training on synthetic images (Ruis et al., 2024).
- Use strong data augmentation (MixUp, Mosaic, large-scale jitter, color transforms) to avoid overfitting to synthetic artifacts and to bridge photorealism deficiencies (Ruis et al., 2024, Shen et al., 2023).
- Fine-tune mid-to-late network layers and detection heads; early convolutional features are less domain-specific and may be frozen (Ljungqvist et al., 2023).
- Evaluation and Deployment:
- Always benchmark TSTR alongside TRTR and, where meaningful, train-on-real, test-on-synthetic (TRTS) for dataset auditing (Ibrahim et al., 22 Oct 2025, Koochali et al., 2022).
- Apply attribution-based metrics or layerwise CKA to prevent falsely high TSTR readings that mask semantic failure (Yu et al., 17 Nov 2025, Ljungqvist et al., 2023).
- For small data settings, mixing modest amounts of real, even from cross-domain, with synthetic can vastly increase effective training set size (Lee et al., 2024, Le et al., 2023).
- Subgroup and Privacy Analysis:
- Synthetic data can support granular (subgroup-level) model evaluation, often matching or outperforming small real test subsets in statistical reliability, enabling privacy-preserving benchmarking (Ibrahim et al., 22 Oct 2025).
6. TSTR Limitations, Analytical Insights, and Future Work
TSTR is a powerful, practical measure but not a comprehensive guarantee:
- Blind Spots: TSTR will not reveal if synthetic data reproduces incorrect label reasoning, has semantic gaps, or over-smooths rare/corner cases; attribution-based checks or task-specific diagnostics are required (Yu et al., 17 Nov 2025, Koochali et al., 2022).
- Task Specificity: TSTR utility is tightly linked to the supervised task and may not generalize to new objectives or transfer learning regimes. Precision/recall/F1, calibration, and robustness metrics are essential to supplement mAP or accuracy (Ljungqvist et al., 2023, Yu et al., 17 Nov 2025).
- Assumptions: Many TSTR pipelines assume that , i.e., that synthetic and real domains share label semantics. If this fails, the TSTR metric conflates distribution and annotation artifacts.
- Generator Quality is Pivotal: The best performing frameworks for TSTR use deep distribution-matching objectives (MMD, guidance, attribute-driven prompting, mutual information selection) and, when provided with sufficient diversity and scale, outperform GANs or Copula models (Yuan et al., 2023, Ibrahim et al., 22 Oct 2025, Kerim et al., 2024, Murad et al., 4 Aug 2025). Scaling up synthetic set cardinality partially substitutes for narrower synthetic fidelity, with observed linear-to-saturating gains (Yuan et al., 2023).
Future directions include curriculum-based or online TSTR, active domain adaptation during deployment, further extension to more complex or multimodal tasks, automated generator evaluation and selection, and formal privacy overlays for sensitive domains (Ibrahim et al., 22 Oct 2025, Du et al., 2023, Kerim et al., 2024).
7. Summary Table: TSTR Evaluation Landscape
| Study / Domain | Generator/Methodology | TSTR Metric/Result | Recommendations / Limitations |
|---|---|---|---|
| Object Detection (Ljungqvist et al., 2023) | YOLOv3 on GTAV/BDD | mAP, 0.12 (vs 0.43 real) | Fine-tune middle layers, diversify cues |
| Satellite Segmentation (Le et al., 2023) | Mask-cond. SPADE-GAN | mIoU, 0.41–0.42 | Mix 50-50 real/synth for best mIoU |
| Tabular, Aviation (Murad et al., 4 Aug 2025) | REaLTabFormer, TabSyn | retention: 94–97% | Use transformer autoregressive models |
| ICU Time Series (Ibrahim et al., 22 Oct 2025) | Enhanced TimeAutoDiff | AUROC | Add MMD/consistency penalties |
| ImageNet Classification (Yuan et al., 2023) | Diffusion+MMD+CLIP | Top-1: 70.9–76% | Scale synthetic data, combine w/ guidance |
| Multimodal Extraction (Du et al., 2023) | MI²RAGE, CCG+MI Filter | F1: 92.8 (surpasses real) | Diversity and high MI essential |
| 2D Pose (Hoffmann et al., 2019) | Synthetic Humans+Augmentation | mAP: 13.4 (synthetic) | Teacher-student, focus on hard samples |
TSTR is now a foundational evaluation for learning with synthetic data, offering actionable diagnosis of the synthetic–real gap and providing clear guidance for synthetic data generation and task-specific transfer. Best practices mandate pairing TSTR with diversity/attribution-based diagnostics and optimizing both distributional match and label fidelity for target deployments.