Synthpop++: A Hybrid Framework for Generating A Country-scale Synthetic Population (2304.12284v2)
Abstract: Population censuses are vital to public policy decision-making. They provide insight into human resources, demography, culture, and economic structure at local, regional, and national levels. However, such surveys are very expensive (especially for low and middle-income countries with high populations, such as India), time-consuming, and may also raise privacy concerns, depending upon the kinds of data collected. In light of these issues, we introduce SynthPop++, a novel hybrid framework, which can combine data from multiple real-world surveys (with different, partially overlapping sets of attributes) to produce a real-scale synthetic population of humans. Critically, our population maintains family structures comprising individuals with demographic, socioeconomic, health, and geolocation attributes: this means that our ``fake'' people live in realistic locations, have realistic families, etc. Such data can be used for a variety of purposes: we explore one such use case, Agent-based modelling of infectious disease in India. To gauge the quality of our synthetic population, we use both machine learning and statistical metrics. Our experimental results show that synthetic population can realistically simulate the population for various administrative units of India, producing real-scale, detailed data at the desired level of zoom -- from cities, to districts, to states, eventually combining to form a country-scale synthetic population.
- Eric Bonabeau. Agent-based modeling: Methods and techniques for simulating human systems. Proceedings of the National Academy of Sciences, 99(3):7280–7287, 2002.
- Generating multi-label discrete patient records using generative adversarial networks. volume 68, 2017. URL http://dblp.uni-trier.de/db/conf/mlhc/mlhc2017.html#ChoiBMDSS17.
- India Human Development Survey-II (IHDS-II), 2011-12. Inter-university Consortium for Political and Social Research, 2018.
- High resolution population distribution maps for southeast asia in 2010 and 2015. PloS one, 8(2):e55882, 2013.
- Robert J. Hijmans. Database of Global Administrative Areas, 2018. URL https://gadm.org/data.html.
- D. Koller and N. Friedman. Probabilistic Graphical Models: Principles and Techniques. 2009. URL https://books.google.co.in/books?id=7dzpHCHzNQ4C.
- GoI Ministry of Education. Steps taken by government to provide education to poor student, Jul 2019. URL https://pib.gov.in/PressReleasePage.aspx?PRID=1578389.
- National Sample Survey Office,NSSO. NSS 68th Round, 2012. URL http://www.icssrdataservice.in/datarepository/index.php/catalog/91.
- Office of the Census Commissioner of India. Census Tables, 2011a. URL https://censusindia.gov.in/census.website/data/census-tables.
- Office of the Census Commissioner of India. Districts of Maharashtra, 2011b. URL https://www.census2011.co.in/census/state/districtlist/maharashtra.html.
- Office of the Registrar General, India. Centre of India, 2021 - Circular No. 6, 2019. URL https://censusindia.gov.in/nada/index.php/catalog/40515/download/44147/ORGI_circular006_2021.pdf.
- Modeling Epidemics With Compartmental Models. JAMA, 323(24):2515–2516, 06 2020. ISSN 0098-7484. doi: 10.1001/jama.2020.8420. URL https://doi.org/10.1001/jama.2020.8420.
- Sync: A unified framework for generating synthetic population with gaussian copula, 2019. URL https://arxiv.org/abs/1904.07998.
- Modeling tabular data using conditional gan. In Advances in Neural Information Processing Systems, 2019.
- Methodology to match distributions of both household and person attributes in generation of synthetic populations. 01 2009.
- Synthetic data approach for classification and regression. 2018. URL http://dblp.uni-trier.de/db/conf/asap/asap2018.html#YueLYW18.
- Privbayes: private data release via bayesian networks. 2014. URL http://dblp.uni-trier.de/db/conf/sigmod/sigmod2014.html#ZhangCPSX14.