When and how synthetic data improve generalization and transfer

Characterize the regimes in which synthetic data augmentation improves out‑of‑distribution generalization and transfer learning performance, including identifying beneficial types of distributional shift, quantifying the impact of generative‑model estimation error, and developing diagnostics for harmful extrapolation.

Background

Synthetic data are increasingly used to improve out‑of‑distribution generalization by covering underrepresented regions or simulating shifts, but theoretical understanding of when this helps remains limited.

The paper explicitly calls for characterizing the conditions under which synthetic augmentation aids generalization and transfer, emphasizing the roles of shift types, generative‑model estimation error, and the need for diagnostics to detect harmful extrapolation.

References

A central open problem is therefore to characterize when and how synthetic data improve generalization ability and transferability. This includes, but is not limited to, identifying the types of distributional shifts for which synthetic augmentation is beneficial, understanding the role of the estimation error of the generative model, and developing diagnostics to detect harmful extrapolation.

Harnessing Synthetic Data from Generative AI for Statistical Inference  (2603.05396 - Abdel-Azim et al., 5 Mar 2026) in Section 4, Extrapolation, Generalization, and Transfer