Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
119 tokens/sec
GPT-4o
56 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Combining propensity score methods with variational autoencoders for generating synthetic data in presence of latent sub-groups (2312.07781v1)

Published 12 Dec 2023 in cs.LG

Abstract: In settings requiring synthetic data generation based on a clinical cohort, e.g., due to data protection regulations, heterogeneity across individuals might be a nuisance that we need to control or faithfully preserve. The sources of such heterogeneity might be known, e.g., as indicated by sub-groups labels, or might be unknown and thus reflected only in properties of distributions, such as bimodality or skewness. We investigate how such heterogeneity can be preserved and controlled when obtaining synthetic data from variational autoencoders (VAEs), i.e., a generative deep learning technique that utilizes a low-dimensional latent representation. To faithfully reproduce unknown heterogeneity reflected in marginal distributions, we propose to combine VAEs with pre-transformations. For dealing with known heterogeneity due to sub-groups, we complement VAEs with models for group membership, specifically from propensity score regression. The evaluation is performed with a realistic simulation design that features sub-groups and challenging marginal distributions. The proposed approach faithfully recovers the latter, compared to synthetic data approaches that focus purely on marginal distributions. Propensity scores add complementary information, e.g., when visualized in the latent space, and enable sampling of synthetic data with or without sub-group specific characteristics. We also illustrate the proposed approach with real data from an international stroke trial that exhibits considerable distribution differences between study sites, in addition to bimodality. These results indicate that describing heterogeneity by statistical approaches, such as propensity score regression, might be more generally useful for complementing generative deep learning for obtaining synthetic data that faithfully reflects structure from clinical cohorts.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (35)
  1. Data Augmentation Generative Adversarial Networks, Mar. 2018.
  2. P. C. Austin. The performance of different propensity score methods for estimating marginal hazard ratios. Statistics in Medicine, 32(16):2837, July 2013. doi: 10.1002/sim.5705.
  3. Moving towards best practice when using inverse probability of treatment weighting (IPTW) using the propensity score to estimate causal treatment effects in observational studies. Statistics in Medicine, 34(28):3661–3679, Dec. 2015. ISSN 1097-0258. doi: 10.1002/sim.6607.
  4. S. Banerjee and T. R. P. Bishop. dsSynthetic: Synthetic data generation for the DataSHIELD federated analysis system. BMC Research Notes, 15(1):230, June 2022. ISSN 1756-0500. doi: 10.1186/s13104-022-06111-2.
  5. What can the Real World do for simulation studies? A comparison of exploratory methods. https://epub.ub.uni-muenchen.de/24518/, Apr. 2015.
  6. Recovery of original individual person data (IPD) inferences from empirical IPD summaries only: Applications to distributed computing under disclosure constraints. Statistics in Medicine, 39(8):1183–1198, Apr. 2020. ISSN 1097-0258. doi: 10.1002/sim.8470.
  7. An Analysis of Transformations. Journal of the Royal Statistical Society: Series B (Methodological), 26(2):211–243, July 1964. ISSN 0035-9246. doi: 10.1111/j.2517-6161.1964.tb00553.x.
  8. DataSHIELD: An ethically robust solution to multiple-site individual-level data analysis. Public Health Genomics, 18(2):87–96, 2015. ISSN 1662-8063. doi: 10.1159/000368959.
  9. Generation and evaluation of synthetic patient data. BMC Medical Research Methodology, 20(1):108, May 2020. ISSN 1471-2288. doi: 10.1186/s12874-020-00977-1.
  10. I. Goodfellow. NIPS 2016 Tutorial: Generative Adversarial Networks, Apr. 2017.
  11. Generative Adversarial Nets. In Advances in Neural Information Processing Systems, volume 27. Curran Associates, Inc., 2014.
  12. Variational Autoencoder With Optimizing Gaussian Mixture Model Priors. IEEE Access, 8:43992–44005, 2020. ISSN 2169-3536. doi: 10.1109/ACCESS.2020.2977671.
  13. A Framework for Evaluating the Utility of Data Altered to Protect Confidentiality. The American Statistician, 60(3):224–232, Aug. 2006. ISSN 0003-1305. doi: 10.1198/000313006X124640.
  14. D. P. Kingma and M. Welling. Auto-Encoding Variational Bayes. https://arxiv.org/abs/1312.6114v11, Dec. 2013.
  15. A Simple-to-Use R Package for Mimicking Study Data by Simulations. Methods of Information in Medicine, 62(03-04):119–129, Apr. 2023. ISSN 0026-1270. doi: 10.1055/a-2048-7692.
  16. Deep generative models in DataSHIELD. BMC Medical Research Methodology, 21(1):64, Apr. 2021. ISSN 1471-2288. doi: 10.1186/s12874-021-01237-6.
  17. Balancing Covariates via Propensity Score Weighting. Journal of the American Statistical Association, 113(521):390–400, Jan. 2018. ISSN 0162-1459. doi: 10.1080/01621459.2016.1260466.
  18. Generative Adversarial Minority Oversampling. In 2019 IEEE/CVF International Conference on Computer Vision (ICCV), pages 1695–1704. IEEE Computer Society, Oct. 2019. ISBN 978-1-72814-803-8. doi: 10.1109/ICCV.2019.00178.
  19. Handling incomplete heterogeneous data using VAEs. Pattern Recognition, 107:107501, Nov. 2020. ISSN 0031-3203. doi: 10.1016/j.patcog.2020.107501.
  20. Synthpop: Bespoke Creation of Synthetic Data in R. Journal of Statistical Software, 74:1–26, Oct. 2016. ISSN 1548-7660. doi: 10.18637/jss.v074.i11.
  21. Autoregressive Quantile Networks for Generative Modeling. In Proceedings of the 35th International Conference on Machine Learning, pages 3936–3945. PMLR, July 2018.
  22. In silico clinical trials: Concepts and early adoptions. Briefings in Bioinformatics, 20(5):1699–1708, Sept. 2019. ISSN 1477-4054. doi: 10.1093/bib/bby043.
  23. Estimating the success of re-identifications in incomplete datasets using generative models. Nature Communications, 10(1):3069, July 2019. ISSN 2041-1723. doi: 10.1038/s41467-019-10933-3.
  24. The Central Role of the Propensity Score in Observational Studies for Causal Effects. Biometrika, 70(1):41–55, 1983. ISSN 0006-3444. doi: 10.2307/2335942.
  25. Learning representations by back-propagating errors. Nature, 323(6088):533–536, Oct. 1986. ISSN 1476-4687. doi: 10.1038/323533a0.
  26. Data augmentation using Variational Autoencoders for improvement of respiratory disease classification. PLoS ONE, 17(8):e0266467, Aug. 2022. ISSN 1932-6203. doi: 10.1371/journal.pone.0266467.
  27. The International Stroke Trial database. Trials, 12(1):101, Apr. 2011. ISSN 1745-6215. doi: 10.1186/1745-6215-12-101.
  28. W. Sauerbrei and P. Royston. Building multivariable prognostic and diagnostic models: Transformation of the predictors by using fractional polynomials. Journal of the Royal Statistical Society: Series A (Statistics in Society), 162(1):71–94, 1999. ISSN 1467-985X. doi: 10.1111/1467-985X.00122.
  29. Randomized and non-randomized patients in clinical trials: Experiences with comprehensive cohort studies. Statistics in Medicine, 15(3):263–271, Feb. 1996. ISSN 0277-6715. doi: 10.1002/(SICI)1097-0258(19960215)15:3<263::AID-SIM165>3.0.CO;2-K.
  30. E. H. Simpson. The Interpretation of Interaction in Contingency Tables. Journal of the Royal Statistical Society: Series B (Methodological), 13(2):238–241, July 1951. ISSN 0035-9246. doi: 10.1111/j.2517-6161.1951.tb00088.x.
  31. General and Specific Utility Measures for Synthetic Data. Journal of the Royal Statistical Society Series A: Statistics in Society, 181(3):663–688, June 2018. ISSN 0964-1998. doi: 10.1111/rssa.12358.
  32. Multimodal deep learning for biomedical data fusion: A review. Briefings in Bioinformatics, 23(2):bbab569, Mar. 2022. ISSN 1477-4054. doi: 10.1093/bib/bbab569.
  33. Development of Synthetic Patient Populations and In Silico Clinical Trials. In J. Bassaganya-Riera, editor, Accelerated Path to Cures, pages 57–77. Springer International Publishing, Cham, 2018. ISBN 978-3-319-73238-1. doi: 10.1007/978-3-319-73238-1_5.
  34. Modified ART study - Simulation design for an artifical but realistic human study dataset. Zenodo, Feb. 2020a.
  35. Automatic variable selection for exposure-driven propensity score matching with unmeasured confounders. Biometrical Journal, 62(3):868–884, 2020b. ISSN 1521-4036. doi: 10.1002/bimj.201800190.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (5)
  1. Kiana Farhadyar (2 papers)
  2. Federico Bonofiglio (2 papers)
  3. Maren Hackenberg (6 papers)
  4. Daniela Zoeller (2 papers)
  5. Harald Binder (20 papers)
Citations (1)

Summary

We haven't generated a summary for this paper yet.