TSTR: Train on Synthetic, Test on Real

Updated 19 January 2026

TSTR is a paradigm that trains machine learning models on synthetic data and evaluates them on real-world datasets to assess domain transfer and generalization.
It involves generating data via simulations, GANs, or procedural methods and assessing performance using key metrics like accuracy, AUROC, and regression losses.
Recent advances integrate hybrid training, diversity regularizers, and attribution-based diagnostics to improve robustness and align synthetic data with real-world characteristics.

The Train on Synthetic – Test on Real (TSTR) paradigm is a foundational experimental protocol for assessing the real-world utility of machine learning models trained on data not directly drawn from the target deployment domain, but instead generated via simulation, procedural generation, GANs, or other synthetic data engines. In TSTR, a model is trained exclusively (or predominantly) on synthetic data and subsequently evaluated on a genuine, independently sourced real-world test set. The approach operationalizes questions of domain transfer, data-driven generalization, and practical value of synthetic datasets in supervised and semi-supervised learning pipelines, and it is widely used in computer vision, natural language, time series, and structured/tabular prediction research.

1. Formal Definition and Protocols

A canonical TSTR pipeline involves the following components:

Let $S_{\text{syn}}$ denote a synthetic training set, $S_{\text{real}}$ a real hold-out test set, and $f_{\text{syn}}$ a supervised model trained on $S_{\text{syn}}$ .
The primary metric is downstream accuracy, measured as

$\text{TSTR}(f_{\text{syn}}, S_{\text{real}}) = \frac{1}{|S_{\text{real}}|} \sum_{(x, y) \in S_{\text{real}}} \mathbf{1}(f_{\text{syn}}(x) = y)$

or, for regression, an appropriate loss (e.g., $\mathrm{RMSE}$ , $R^2$ ).

For time series, $f_{\text{syn}}$ may be a classifier or regressor trained on synthetic trajectories, evaluated on real sequences for standard metrics: classification accuracy, AUROC, AUPRC, etc. (Esteban et al., 2017, Koochali et al., 2022, Yu et al., 17 Nov 2025).
In tabular and operational settings, the TSTR score may be normalized (e.g., as $U_{\mathrm{model}}$ ) by the corresponding TRTR (train/test on real) score to yield a percentage utility metric (Murad et al., 4 Aug 2025).

The typical workflow includes:

Fit a generator to available real data (or use a parametric/simulation model) to produce $S_{\text{syn}}$ .
Train a model solely on $S_{\text{real}}$ 0; select hyperparameters via cross-validation on synthetic validation data, or, if permitted, a small real held-out set.
Evaluate on an untouched real-world hold-out $S_{\text{real}}$ 1.
Compare performance to models trained on real data only (TRTR), or on other synthetic data protocols.

2. Metrics and Diagnostic Tools

Beyond raw TSTR accuracy or regression metrics, several advanced metrics diagnose domain gap, representational mismatch, and semantic coverage.

train2test distance ( $S_{\text{real}}$ 2): Mahalanobis distance in feature space between test instance features and the mean/covariance of the training-set features,

$S_{\text{real}}$ 3

with $S_{\text{real}}$ 4 a penultimate-layer feature extractor; a domain gap proxy (Lee et al., 2024).

AP $S_{\text{real}}$ 5 (Distance-based Average Precision): Precision-recall calculated not with respect to model confidence but with respect to train2test distance; high AP $S_{\text{real}}$ 6 indicates true positives are close to the training distribution (Lee et al., 2024).
SHAP Distance: Cosine distance between global SHAP attributions for models trained on synthetic vs. real data, quantifying misalignment in feature semantics regardless of predictive performance (Yu et al., 17 Nov 2025).
Relative TSTR ( $S_{\text{real}}$ 7): The drop compared to a real-trained baseline, i.e., $S_{\text{real}}$ 8 (Koochali et al., 2022).
Subgroup/AUC Error: For population heterogeneity, $S_{\text{real}}$ 9 is used to track AUROC differences between models trained on synthetic vs. real data over all test subgroups (Ibrahim et al., 22 Oct 2025).

3. Methodological Advances and Practical Guidelines

Synthetic Data Selection and Augmentation

Progressive Transformation Learning (PTL): Iteratively select synthetic samples whose features are closest (in Mahalanobis sense) to the expanded training set; transform them via syn2real operations and incorporate for better domain coverage (Lee et al., 2024).
Diversity Regularizers: Directly encourage generative models to maximize output variance subject to class or mask constraints (e.g., SPADE+DSGAN $f_{\text{syn}}$ 0-diversity in satellite imagery (Le et al., 2023)).
Multi-Armed Bandit Selection: Dynamically select the most beneficial subset of synthetic data by ranking either by photorealism/diversity scores or feature cohesion, using UCB-based reward tracking (Kerim et al., 2024).

Model Architecture and Data Mixing

Transformer-based Backbones: Shape-prior architectures such as Swin Transformer, when strongly augmented, can nearly close the real-vs-synthetic detection performance gap in scenarios such as object and vehicle recognition (Ruis et al., 2024).
Hybrid Training: Supplement synthetic data with a budget of cross-domain real images; even 20–200 real instances can result in dramatic AP gains, with diminishing improvements past 100–200 (Lee et al., 2024).
Adversarial Student-Teacher Sampling: Use a teacher to mine the hardest synthetic examples for the student model, dynamically targeting feature-space regions poorly covered by current training data (Hoffmann et al., 2019).
Domain-Specific Input Transformations: Linear operations such as cross-correlation and convolution by autocorrelations (MLReal) are effective for waveform and geophysical data to reduce domain divergence (Alkhalifah et al., 2021).

4. Empirical Performance and Limitations

Bounding TSTR Gaps: High-quality latent-diffusion or Transformer-based data generators (e.g., Enhanced TimeAutoDiff, REaLTabFormer) consistently achieve $f_{\text{syn}}$ 1 AUROC or 94–97% of real-data regression performance, effectively saturating downstream utility for complex tabular targets (Ibrahim et al., 22 Oct 2025, Murad et al., 4 Aug 2025).
Scaling Behavior: TSTR accuracy systematically improves with increasing synthetic set size, with a plateau determined by generator fidelity and domain coverage. For ImageNet-1K, 10× synthetic data approaches within 3 pp of real-only accuracy; similar scaling observed across tabular and time-series domains (Yuan et al., 2023, Murad et al., 4 Aug 2025).
Modality and Task Variance: The TSTR gap varies significantly across domains:
- In object detection, cross-domain synthetic-to-real AP improves only with diversity of viewpoints and backgrounds; performance stagnates if synthetic data lack critical domain clutter (e.g., background in HERIDAL) (Lee et al., 2024).
- In multimodal relation extraction, MI2RAGE demonstrates that diversity augmentation and mutual-information-based filtering of synthetic data can enable TSTR models to surpass real-trained SOTA on real test sets (Du et al., 2023).
- TSTR on time-series (medical) data with state-of-the-art generative models results in only minor (≤1.6% accuracy; 5–12% AUROC/AUPRC) degradation compared to real-only baselines (Esteban et al., 2017, Koochali et al., 2022, Ibrahim et al., 22 Oct 2025).

Task/Domain	Real-Only Perf.	TSTR Perf. (Top)/Notes	Reference
Human detection	18–22%	26–48% with 20–200 real images added	(Lee et al., 2024)
Tabular regression	$f_{\text{syn}}$ 2–0.44	94–97% of real utility (REaLTabFormer)	(Murad et al., 4 Aug 2025)
Object detection	mAP@50 79–95%	Swin-T/S achieves up to 95%	(Ruis et al., 2024)
Satellite segment.	mIoU=0.52	mIoU=0.40 (synth), 0.58 (50–50 mix)	(Le et al., 2023)
Time-series (AUROC)	0.96–0.99	$f_{\text{syn}}$ 3 0.01	(Ibrahim et al., 22 Oct 2025)

Spurious Correlation Blindness: TSTR alone does not guarantee semantic fidelity; models can exploit correlations present in both synthetic and test sets but not causal in the target domain. This is evident in tabular settings, where models may overemphasize specific features if generator artifacts align with label distributions (Yu et al., 17 Nov 2025).
Mode Collapse/Drop Detection: TSTR sharply detects generator failures that lose class modes or collapse diversity (rel(TSTR) grows as more modes drop/collapse), unlike TRTS or FID (Koochali et al., 2022).
Semantic Attributive Gaps: Attribution-based metrics (e.g., SHAP Distance) are required alongside TSTR to audit feature importance and decision-making alignment; TSTR can remain high even when feature importances diverge significantly from those learned on real data (Yu et al., 17 Nov 2025).
Head/Zonal Adaptation Limits: In object detectors, representational similarity (by CKA) reveals that the largest synthetic–real domain gap localizes to specialized head/later layers, indicating global feature-based domain alignment is not sufficient; most synthetic–real transfer must target these "head" blocks (Ljungqvist et al., 2023).

6. Practical Recommendations and Design Patterns

Combine synthetic and small numbers of real cross-domain images, emphasizing feature-space diversity to minimize train2test distance and maximize AP $f_{\text{syn}}$ 4 (Lee et al., 2024).
Deploy strong augmentations (MixUp, RandAugment, large-scale jitter) and shape-biased architectures (Transformers) to prevent overfitting to synthetic artifacts and to enhance transferability, especially for large capacity models (Ruis et al., 2024).
Monitor not only TSTR accuracy but also semantic attribution distances (SHAP Distance), especially in regulated or safety-critical domains to avoid failure by "shortcut learning" (Yu et al., 17 Nov 2025).
For class-imbalanced or subgroup-sensitive applications, adopt decoupled training (e.g., From Fake to Real—FFR) to prevent confounded subgroup-synthetic bias and maximize worst-group accuracy (Qraitem et al., 2023).
Assess generator quality for coverage and diversity: in time-series and tabular settings, pair TSTR with metrics for feature-space coverage (FITD, MMD) and discriminatory attribution to flag latent model distortions (Koochali et al., 2022, Yu et al., 17 Nov 2025).
In simulation-based domains, use explicit scene randomization, photorealistic style transfer, and anti-curriculum teacher sampling to target rare or hard synthetic samples for enhanced robust TSTR performance (Hoffmann et al., 2019, Tang et al., 2023).

7. Outlook and Ongoing Research Directions

Research continues to address open challenges in TSTR, including the development of generative pipelines that can:

Systematically narrow the residual $f_{\text{syn}}$ 5 by explicitly matching not only marginal and class-conditional distributions but also task-driven “semantic” properties such as feature importances, local dependencies, and subgroup fairness (Yuan et al., 2023, Yu et al., 17 Nov 2025).
Adaptively generate synthetic data in a feedback loop guided by downstream performance metrics, attribution gaps, or dynamic usability scores, as in UCB-based or mutual-information–maximization frameworks (Kerim et al., 2024, Du et al., 2023).
Extend robustness to more abstract or multimodal settings, where chained cross-domain generation and annotator-in-the-loop filtering can outperform both synthetic and real data baselines for complex tasks such as multimodal relation extraction (Du et al., 2023).

The rigorous deployment of TSTR as a utility, fidelity, and robustness benchmark remains critical for establishing the credibility and limitations of synthetic data in advancing data-driven machine learning in domains with high annotation, privacy, or sampling barriers.