Papers
Topics
Authors
Recent
2000 character limit reached

Synthetic Data in AI: Challenges, Applications, and Ethical Implications (2401.01629v1)

Published 3 Jan 2024 in cs.LG, cs.AI, and cs.CY

Abstract: In the rapidly evolving field of artificial intelligence, the creation and utilization of synthetic datasets have become increasingly significant. This report delves into the multifaceted aspects of synthetic data, particularly emphasizing the challenges and potential biases these datasets may harbor. It explores the methodologies behind synthetic data generation, spanning traditional statistical models to advanced deep learning techniques, and examines their applications across diverse domains. The report also critically addresses the ethical considerations and legal implications associated with synthetic datasets, highlighting the urgent need for mechanisms to ensure fairness, mitigate biases, and uphold ethical standards in AI development.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (51)
  1. F. Thabtah, S. Hammoud, F. Kamalov, and A. Gonsalves, “Data imbalance in classification: Experimental evaluation,” Information Sciences, vol. 513, pp. 429–441, 2020.
  2. M. Favaretto, E. De Clercq, and B. S. Elger, “Big data and discrimination: perils, promises and solutions. a systematic review,” Journal of Big Data, vol. 6, no. 1, pp. 1–27, 2019.
  3. D. P. Kingma and M. Welling, “Auto-encoding variational bayes,” arXiv preprint arXiv:1312.6114, 2013.
  4. I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio, “Generative adversarial networks,” Communications of the ACM, vol. 63, no. 11, pp. 139–144, 2020.
  5. J. Ho, A. Jain, and P. Abbeel, “Denoising diffusion probabilistic models,” Advances in neural information processing systems, vol. 33, pp. 6840–6851, 2020.
  6. S. Azizi, S. Kornblith, C. Saharia, M. Norouzi, and D. J. Fleet, “Synthetic data from diffusion models improves imagenet classification,” arXiv preprint arXiv:2304.08466, 2023.
  7. H. Fang, B. Han, S. Zhang, S. Zhou, C. Hu, and W.-M. Ye, “Data augmentation for object detection via controllable diffusion models,” in WACV 2024, 2024.
  8. Q. H. Nguyen, T. T. Vu, A. T. Tran, and K. Nguyen, “Dataset diffusion: Diffusion-based synthetic data generation for pixel-level semantic segmentation,” in Thirty-seventh Conference on Neural Information Processing Systems, 2023.
  9. A. Radford, K. Narasimhan, T. Salimans, I. Sutskever, et al., “Improving language understanding by generative pre-training,” 2018.
  10. A. Radford, J. Wu, R. Child, D. Luan, D. Amodei, I. Sutskever, et al., “Language models are unsupervised multitask learners,” OpenAI blog, vol. 1, no. 8, p. 9, 2019.
  11. T. Brown, B. Mann, N. Ryder, M. Subbiah, J. D. Kaplan, P. Dhariwal, A. Neelakantan, P. Shyam, G. Sastry, A. Askell, et al., “Language models are few-shot learners,” Advances in neural information processing systems, vol. 33, pp. 1877–1901, 2020.
  12. M. Josifoski, M. Sakota, M. Peyrard, and R. West, “Exploiting asymmetry for synthetic training data generation: SynthIE and the case of information extraction,” in Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing (H. Bouamor, J. Pino, and K. Bali, eds.), (Singapore), pp. 1555–1574, Association for Computational Linguistics, Dec. 2023.
  13. C. Whitehouse, M. Choudhury, and A. Aji, “LLM-powered data augmentation for enhanced cross-lingual performance,” in Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing (H. Bouamor, J. Pino, and K. Bali, eds.), (Singapore), pp. 671–686, Association for Computational Linguistics, Dec. 2023.
  14. X. Sun and L. Zheng, “Dissecting person re-identification from the viewpoint of viewpoint,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 608–617, 2019.
  15. Y. Yao, L. Zheng, X. Yang, M. Naphade, and T. Gedeon, “Simulating content consistent vehicle datasets with attribute descent,” in Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part VI 16, pp. 775–791, Springer, 2020.
  16. Z. Tang, M. Naphade, S. Birchfield, J. Tremblay, W. Hodge, R. Kumar, S. Wang, and X. Yang, “Pamtri: Pose-aware multi-task learning for vehicle re-identification using highly randomized synthetic data,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 211–220, 2019.
  17. Q. Wang, J. Gao, W. Lin, and Y. Yuan, “Learning from synthetic data for crowd counting in the wild,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 8198–8207, 2019.
  18. J.-Y. Zhu, T. Park, P. Isola, and A. A. Efros, “Unpaired image-to-image translation using cycle-consistent adversarial networks,” in Proceedings of the IEEE international conference on computer vision, pp. 2223–2232, 2017.
  19. R. Torkzadehmahani, P. Kairouz, and B. Paten, “Dp-cgan: Differentially private synthetic data and label generation,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, pp. 0–0, 2019.
  20. A. Brock, J. Donahue, and K. Simonyan, “Large scale gan training for high fidelity natural image synthesis,” in International Conference on Learning Representations, 2018.
  21. M. Niemeyer and A. Geiger, “Giraffe: Representing scenes as compositional generative neural feature fields,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11453–11464, 2021.
  22. C. Donahue, J. McAuley, and M. Puckette, “Adversarial audio synthesis,” arXiv preprint arXiv:1802.04208, 2018.
  23. X. Zhang, I. Vallés-Pérez, A. Stolcke, C. Yu, J. Droppo, O. Shonibare, R. Barra-Chicote, and V. Ravichandran, “Stutter-tts: Controlled synthesis and improved recognition of stuttered speech,” arXiv preprint arXiv:2211.09731, 2022.
  24. C. Dewi, R.-C. Chen, Y.-T. Liu, and S.-K. Tai, “Synthetic data generation using dcgan for improved traffic sign recognition,” Neural Computing and Applications, vol. 34, no. 24, pp. 21465–21480, 2022.
  25. G. Forman, “An extensive empirical study of feature selection metrics for text classification,” Journal of Machine Learning Research, vol. 3, pp. 1289–1305, 2003.
  26. X. Yue, H. A. Inan, X. Li, G. Kumar, J. McAnallen, H. Sun, D. Levitan, and R. Sim, “Synthetic text generation with differential privacy: A simple and practical recipe,” arXiv preprint arXiv:2210.14348, 2022.
  27. X. Zheng, Y. Liu, D. Gunceler, and D. Willett, “Using synthetic audio to improve the recognition of out-of-vocabulary words in end-to-end asr systems,” in IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2021.
  28. W. Nie, N. Narodytska, and A. Patel, “Relgan: Relational generative adversarial networks for text generation,” in International conference on learning representations, 2018.
  29. Z. Zhao, A. Zhu, Z. Zeng, B. Veeravalli, and C. Guan, “Act-net: Asymmetric co-teacher network for semi-supervised memory-efficient medical image segmentation,” in 2022 IEEE International Conference on Image Processing (ICIP), pp. 1426–1430, IEEE, 2022.
  30. J. Dahmen and D. Cook, “Synsys: A synthetic data generation system for healthcare applications,” Sensors, vol. 19, no. 5, 2019.
  31. Y. Lu, Y. T. Chang, E. P. Hoffman, G. Yu, and Y. Wang, “Integrated identification of disease specific pathways using multi-omics data,” Cold Spring Harbor Laboratory, 2019.
  32. Z. Wang, P. Myles, and A. Tucker, “Generating and evaluating cross:ectional synthetic electronic healthcare data: Preserving data utility and patient privacy,” Computational Intelligence, no. 3, 2021.
  33. W. Jin, R. Barzilay, and T. Jaakkola, “Junction tree variational autoencoder for molecular graph generation,” 2018.
  34. R. Gómez-Bombarelli, J. N. Wei, D. Duvenaud, J. M. Hernández-Lobato, B. Sánchez-Lengeling, D. Sheberla, J. Aguilera-Iparraguirre, T. D. Hirzel, R. P. Adams, and A. Aspuru-Guzik, “Automatic chemical design using a data-driven continuous representation of molecules,” ACS central science, vol. 4, no. 2, pp. 268–276, 2018.
  35. B. Zhang, Y. Fu, Y. Lu, Z. Zhang, R. Clarke, J. E. Van Eyk, D. M. Herrington, and Y. Wang, “Ddn2. 0: R and python packages for differential dependency network analysis of biological systems,” bioRxiv, pp. 2021–04, 2021.
  36. N. De Cao and T. Kipf, “Molgan: An implicit generative model for small molecular graphs. arxiv 2018,” arXiv preprint arXiv:1805.11973, 2019.
  37. T. Fu and J. Sun, “Antibody complementarity determining regions (cdrs) design using constrained energy model,” in Proceedings of the 28th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, pp. 389–399, 2022.
  38. T. Fu, W. Gao, C. Xiao, J. Yasonik, C. W. Coley, and J. Sun, “Differentiable scaffolding tree for molecular optimization,” arXiv preprint arXiv:2109.10469, 2021.
  39. M. Xu, L. Yu, Y. Song, C. Shi, S. Ermon, and J. Tang, “Geodiff: A geometric diffusion model for molecular conformation generation,” arXiv preprint arXiv:2203.02923, 2022.
  40. Z. Zhou, S. Kearnes, L. Li, R. N. Zare, and P. Riley, “Optimization of molecules via deep reinforcement learning,” Scientific reports, vol. 9, no. 1, p. 10752, 2019.
  41. M. Olivecrona, T. Blaschke, O. Engkvist, and H. Chen, “Molecular de-novo design through deep reinforcement learning,” Journal of cheminformatics, vol. 9, no. 1, pp. 1–14, 2017.
  42. T. Fu, W. Gao, C. Coley, and J. Sun, “Reinforced genetic algorithm for structure-based drug design,” Advances in Neural Information Processing Systems, vol. 35, pp. 12325–12338, 2022.
  43. J. H. Jensen, “A graph-based genetic algorithm and generative model/monte carlo tree search for the exploration of chemical space,” Chemical science, vol. 10, no. 12, pp. 3567–3572, 2019.
  44. T. Fu, C. Xiao, X. Li, L. M. Glass, and J. Sun, “Mimosa: Multi-constraint molecule sampling for molecule optimization,” in Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 125–133, 2021.
  45. T. Fu and J. Sun, “Sipf: Sampling method for inverse protein folding,” in Proceedings of the 28th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, pp. 378–388, 2022.
  46. A. Kundu, A. Tagliasacchi, A. Y. Mak, A. Stone, C. Doersch, C. Oztireli, C. Herrmann, D. Gnanapragasam, D. Duckworth, D. Rebain, et al., “Kubric: A scalable dataset generator,” 2022.
  47. J. Zheng, J. Zhang, J. Li, R. Tang, S. Gao, and Z. Zhou, “Structured3d: A large photo-realistic dataset for structured 3d modeling,” in Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part IX 16, pp. 519–535, Springer, 2020.
  48. S. Gowal, S.-A. Rebuffi, O. Wiles, F. Stimberg, D. A. Calian, and T. A. Mann, “Improving robustness using generated data,” Advances in Neural Information Processing Systems, vol. 34, pp. 4218–4233, 2021.
  49. R. He, S. Sun, X. Yu, C. Xue, W. Zhang, P. Torr, S. Bai, and X. Qi, “Is synthetic data from generative models ready for image recognition?,” arXiv preprint arXiv:2210.07574, 2022.
  50. R. Srinivasan and K. Uchino, “Biases in generative art - A causal look from the lens of art history,” CoRR, vol. abs/2010.13266, 2020.
  51. V. U. Prabhu and A. Birhane, “Large image datasets: A pyrrhic win for computer vision?,” CoRR, vol. abs/2006.16923, 2020.
Citations (10)

Summary

We haven't generated a summary for this paper yet.

Whiteboard

Paper to Video (Beta)

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Continue Learning

We haven't generated follow-up questions for this paper yet.

Collections

Sign up for free to add this paper to one or more collections.

Tweets

Sign up for free to view the 1 tweet with 2 likes about this paper.