Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
102 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Balanced Mixed-Type Tabular Data Synthesis with Diffusion Models (2404.08254v2)

Published 12 Apr 2024 in cs.LG

Abstract: Diffusion models have emerged as a robust framework for various generative tasks, including tabular data synthesis. However, current tabular diffusion models tend to inherit bias in the training dataset and generate biased synthetic data, which may influence discriminatory actions. In this research, we introduce a novel tabular diffusion model that incorporates sensitive guidance to generate fair synthetic data with balanced joint distributions of the target label and sensitive attributes, such as sex and race. The empirical results demonstrate that our method effectively mitigates bias in training data while maintaining the quality of the generated samples. Furthermore, we provide evidence that our approach outperforms existing methods for synthesizing tabular data on fairness metrics such as demographic parity ratio and equalized odds ratio, achieving improvements of over $10\%$. Our implementation is available at https://github.com/comp-well-org/fair-tab-diffusion.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (49)
  1. A reductions approach to fair classification. In International conference on machine learning, pp.  60–69. PMLR, 2018.
  2. Structured denoising diffusion models in discrete state-spaces. Advances in Neural Information Processing Systems, 34:17981–17993, 2021.
  3. Deep neural networks and tabular data: A survey. IEEE Transactions on Neural Networks and Learning Systems, 2022.
  4. Optimized pre-processing for discrimination prevention. Advances in neural information processing systems, 30, 2017.
  5. Smote: synthetic minority over-sampling technique. Journal of artificial intelligence research, 16:321–357, 2002.
  6. Chen, T. On the importance of noise scheduling for diffusion models. arXiv preprint arXiv:2301.10972, 2023.
  7. Generating multi-label discrete patient records using generative adversarial networks. In Machine learning for healthcare conference, pp.  286–305. PMLR, 2017.
  8. Diffusion models beat gans on image synthesis. Advances in neural information processing systems, 34:8780–8794, 2021.
  9. Fair diffusion: Instructing text-to-image generation models on fairness. arXiv preprint arXiv:2302.10893, 2023.
  10. Equality of opportunity in supervised learning. Advances in neural information processing systems, 29, 2016.
  11. Meddiff: Generating electronic health records using accelerated denoising diffusion model. arXiv preprint arXiv:2302.04355, 2023.
  12. Classifier-free diffusion guidance. arXiv preprint arXiv:2207.12598, 2022.
  13. Denoising diffusion probabilistic models. Advances in neural information processing systems, 33:6840–6851, 2020.
  14. Argmax flows and multinomial diffusion: Learning categorical distributions. Advances in Neural Information Processing Systems, 34:12454–12465, 2021.
  15. Turning the tables: Biased, imbalanced, dynamic tabular datasets for ml evaluation. Advances in Neural Information Processing Systems, 35:33563–33575, 2022.
  16. Data preprocessing techniques for classification without discrimination. Knowledge and information systems, 33(1):1–33, 2012.
  17. Fairness-aware learning through regularization approach. In 2011 IEEE 11th International Conference on Data Mining Workshops, pp.  643–650. IEEE, 2011.
  18. Fairness-aware classifier with prejudice remover regularizer. In Machine Learning and Knowledge Discovery in Databases: European Conference, ECML PKDD 2012, Bristol, UK, September 24-28, 2012. Proceedings, Part II 23, pp.  35–50. Springer, 2012.
  19. Kohavi, R. et al. Scaling up the accuracy of naive-bayes classifiers: A decision-tree hybrid. In Kdd, volume 96, pp.  202–207, 1996.
  20. Diffwave: A versatile diffusion model for audio synthesis. arXiv preprint arXiv:2009.09761, 2020.
  21. Tabddpm: Modelling tabular data with diffusion models. In International Conference on Machine Learning, pp.  17564–17579. PMLR, 2023.
  22. A survey on datasets for fairness-aware machine learning. Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery, 12(3):e1452, 2022.
  23. Imbalanced-learn: A python toolbox to tackle the curse of imbalanced datasets in machine learning. Journal of machine learning research, 18(17):1–5, 2017.
  24. Generative time series forecasting with diffusion, denoise, and disentanglement. Advances in Neural Information Processing Systems, 35:23009–23022, 2022.
  25. A unified approach to interpreting model predictions. Advances in neural information processing systems, 30, 2017.
  26. Vaem: a deep generative model for heterogeneous mixed type data. Advances in Neural Information Processing Systems, 33:11237–11247, 2020.
  27. Diffusion models beat gans on topology optimization. In Proceedings of the AAAI Conference on Artificial Intelligence (AAAI), Washington, DC, 2023.
  28. A survey on bias and fairness in machine learning. ACM computing surveys (CSUR), 54(6):1–35, 2021.
  29. Glide: Towards photorealistic image generation and editing with text-guided diffusion models. arXiv preprint arXiv:2112.10741, 2021.
  30. Data synthesis based on generative adversarial networks. arXiv preprint arXiv:1806.03384, 2018.
  31. The synthetic data vault. In 2016 IEEE International Conference on Data Science and Advanced Analytics (DSAA), pp.  399–410. IEEE, 2016.
  32. A review on fairness in machine learning. ACM Computing Surveys (CSUR), 55(3):1–44, 2022.
  33. On fairness and calibration. Advances in neural information processing systems, 30, 2017.
  34. Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:2204.06125, 1(2):3, 2022.
  35. Autoregressive denoising diffusion models for multivariate probabilistic time series forecasting. In International Conference on Machine Learning, pp.  8857–8868. PMLR, 2021.
  36. High-resolution image synthesis with latent diffusion models. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp.  10684–10695, 2022.
  37. U-net: Convolutional networks for biomedical image segmentation. In Medical Image Computing and Computer-Assisted Intervention–MICCAI 2015: 18th International Conference, Munich, Germany, October 5-9, 2015, Proceedings, Part III 18, pp.  234–241. Springer, 2015.
  38. Photorealistic text-to-image diffusion models with deep language understanding. Advances in Neural Information Processing Systems, 35:36479–36494, 2022.
  39. Findiff: Diffusion models for financial tabular data generation. In Proceedings of the Fourth ACM International Conference on AI in Finance, pp.  64–72, 2023.
  40. Safe latent diffusion: Mitigating inappropriate degeneration in diffusion models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.  22522–22531, 2023.
  41. Deep unsupervised learning using nonequilibrium thermodynamics. In International conference on machine learning, pp.  2256–2265. PMLR, 2015.
  42. Diffused heads: Diffusion models beat gans on talking-face generation. arXiv preprint arXiv:2301.03396, 2023.
  43. Fairgan: Fairness-aware generative adversarial networks. In 2018 IEEE International Conference on Big Data (Big Data), pp.  570–575. IEEE, 2018.
  44. Modeling tabular data using conditional gan. Advances in neural information processing systems, 32, 2019.
  45. The comparisons of data mining techniques for the predictive accuracy of probability of default of credit card clients. Expert systems with applications, 36(2):2473–2480, 2009.
  46. Fairness constraints: Mechanisms for fair classification. In Artificial intelligence and statistics, pp.  962–970. PMLR, 2017.
  47. Bias reducing multitask learning on mental health prediction. In 2022 10th International Conference on Affective Computing and Intelligent Interaction (ACII), pp.  1–8. IEEE, 2022.
  48. Mitigating unwanted biases with adversarial learning. In Proceedings of the 2018 AAAI/ACM Conference on AI, Ethics, and Society, pp.  335–340, 2018.
  49. Text-to-image diffusion model in generative ai: A survey. arXiv preprint arXiv:2303.07909, 2023.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (6)
  1. Zeyu Yang (27 papers)
  2. Peikun Guo (4 papers)
  3. Khadija Zanna (7 papers)
  4. Akane Sano (27 papers)
  5. Han Yu (218 papers)
  6. Xiaoxue Yang (4 papers)
Citations (5)
X Twitter Logo Streamline Icon: https://streamlinehq.com

Tweets