Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
119 tokens/sec
GPT-4o
56 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Generating and Imputing Tabular Data via Diffusion and Flow-based Gradient-Boosted Trees (2309.09968v3)

Published 18 Sep 2023 in cs.LG

Abstract: Tabular data is hard to acquire and is subject to missing values. This paper introduces a novel approach for generating and imputing mixed-type (continuous and categorical) tabular data utilizing score-based diffusion and conditional flow matching. In contrast to prior methods that rely on neural networks to learn the score function or the vector field, we adopt XGBoost, a widely used Gradient-Boosted Tree (GBT) technique. To test our method, we build one of the most extensive benchmarks for tabular data generation and imputation, containing 27 diverse datasets and 9 metrics. Through empirical evaluation across the benchmark, we demonstrate that our approach outperforms deep-learning generation methods in data generation tasks and remains competitive in data imputation. Notably, it can be trained in parallel using CPUs without requiring a GPU. Our Python and R code is available at https://github.com/SamsungSAILMontreal/ForestDiffusion.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (122)
  1. Congressional Voting Records. UCI Machine Learning Repository, 1987. DOI: https://doi.org/10.24432/C5C01P.
  2. Stefan Aeberhard and M. Forina. Wine. UCI Machine Learning Repository, 1991. DOI: https://doi.org/10.24432/C5PC7J.
  3. David Aha. Tic-Tac-Toe Endgame. UCI Machine Learning Repository, 1991. DOI: https://doi.org/10.24432/C5688J.
  4. Bhattarai Ashim. missingpy. https://github.com/epsilon-machine/missingpy, 2013.
  5. Derrick A Bennett. How can i deal with missing data in my study? Australian and New Zealand journal of public health, 25(5):464–469, 2001.
  6. Rajen Bhatt. Planning Relax. UCI Machine Learning Repository, 2012. DOI: https://doi.org/10.24432/C5T023.
  7. Marko Bohanec. Car Evaluation. UCI Machine Learning Repository, 1997. DOI: https://doi.org/10.24432/C5JP48.
  8. Deep neural networks and tabular data: A survey. IEEE Transactions on Neural Networks and Learning Systems, 2022a.
  9. Language models are realistic tabular data generators. arXiv preprint arXiv:2210.06280, 2022b.
  10. Leo Breiman. Random forests. Machine learning, 45:5–32, 2001.
  11. Large scale gan training for high fidelity natural image synthesis. arXiv preprint arXiv:1809.11096, 2018.
  12. Airfoil Self-Noise. UCI Machine Learning Repository, 2014. DOI: https://doi.org/10.24432/C5VW2C.
  13. Samuel F Buck. A method of estimation of missing values in multivariate data suitable for use with an electronic computer. Journal of the Royal Statistical Society: Series B (Methodological), 22(2):302–306, 1960.
  14. Improved conditional vrnns for video prediction. In Proceedings of the IEEE/CVF international conference on computer vision, pages 7608–7617, 2019.
  15. seeds. UCI Machine Learning Repository, 2012. DOI: https://doi.org/10.24432/C5H30K.
  16. Wavegrad: Estimating gradients for waveform generation. arXiv preprint arXiv:2009.00713, 2020.
  17. Neural ordinary differential equations. In Advances in Neural Information Processing Systems 31, 2018.
  18. Xgboost: extreme gradient boosting. R package version 0.4-2, 1(4):1–4, 2015.
  19. Wine Quality. UCI Machine Learning Repository, 2009. DOI: https://doi.org/10.24432/C56S3T.
  20. Nearest neighbor pattern classification. IEEE transactions on information theory, 13(1):21–27, 1967.
  21. Marco Cuturi. Sinkhorn distances: Lightspeed computation of optimal transport. Advances in neural information processing systems, 26, 2013.
  22. Stochastic video generation with a learned prior. In International conference on machine learning, pages 1174–1183. PMLR, 2018.
  23. Connectionist Bench (Vowel Recognition - Deterding Data). UCI Machine Learning Repository. DOI: https://doi.org/10.24432/C58P4S.
  24. Libras Movement. UCI Machine Learning Repository, 2009. DOI: https://doi.org/10.24432/C5GC82.
  25. Allan Donner. The relative effectiveness of procedures commonly used in multiple regression analysis for dealing with missing values. The American Statistician, 36(4):378–381, 1982.
  26. Uci machine learning repository, 2017. URL http://archive.ics.uci.edu/ml.
  27. Learning with minibatch wasserstein : asymptotic and gradient properties. In Silvia Chiappa and Roberto Calandra, editors, Proceedings of the Twenty Third International Conference on Artificial Intelligence and Statistics, volume 108 of Proceedings of Machine Learning Research, pages 2131–2141, Online, 26–28 Aug 2020. PMLR. URL http://proceedings.mlr.press/v108/fatras20a.html.
  28. Minibatch optimal transport distances; analysis and applications, 2021.
  29. William Feller. On the theory of stochastic processes, with particular reference to applications. 1949. URL https://api.semanticscholar.org/CorpusID:121027442.
  30. R. A. Fisher. Iris. UCI Machine Learning Repository, 1988. DOI: https://doi.org/10.24432/C56C76.
  31. Benchmarking state-of-the-art gradient boosting algorithms for classification. arXiv preprint arXiv:2305.17094, 2023.
  32. Stochastic latent residual video prediction. In International Conference on Machine Learning, pages 3233–3246. PMLR, 2020.
  33. Additive logistic regression: a statistical view of boosting (with discussion and a rejoinder by the authors). The annals of statistics, 28(2):337–407, 2000.
  34. Jerome H Friedman. Greedy function approximation: a gradient boosting machine. Annals of statistics, pages 1189–1232, 2001.
  35. Learning generative models with sinkhorn divergences. In Amos Storkey and Fernando Perez-Cruz, editors, Proceedings of the Twenty-First International Conference on Artificial Intelligence and Statistics, volume 84 of Proceedings of Machine Learning Research, pages 1608–1617. PMLR, 09–11 Apr 2018. URL https://proceedings.mlr.press/v84/genevay18a.html.
  36. B. German. Glass Identification. UCI Machine Learning Repository, 1987. DOI: https://doi.org/10.24432/C5WW2P.
  37. Yacht Hydrodynamics. UCI Machine Learning Repository, 2013. DOI: https://doi.org/10.24432/C5XG7R.
  38. Generative adversarial nets. Advances in neural information processing systems, 27, 2014.
  39. John C Gower. A general coefficient of similarity and some of its properties. Biometrics, pages 857–871, 1971.
  40. Why do tree-based models still outperform deep learning on typical tabular data? Advances in Neural Information Processing Systems, 35:507–520, 2022.
  41. Flexible diffusion modeling of long videos. Advances in Neural Information Processing Systems, 35:27953–27965, 2022.
  42. Matrix completion and low-rank svd via fast alternating least squares. The Journal of Machine Learning Research, 16(1):3367–3402, 2015.
  43. Classifier-free diffusion guidance. arXiv preprint arXiv:2207.12598, 2022.
  44. Denoising diffusion probabilistic models. Neural Information Processing Systems (NeurIPS), 2020.
  45. Tin Kam Ho. Random decision forests. In Proceedings of 3rd international conference on document analysis and recognition, volume 1, pages 278–282. IEEE, 1995.
  46. Argmax flows and multinomial diffusion: Learning categorical distributions. Advances in Neural Information Processing Systems, 34:12454–12465, 2021.
  47. Multilayer feedforward networks are universal approximators. Neural networks, 2(5):359–366, 1989.
  48. Harry Joe. Dependence modeling with copulas. CRC press, 2014.
  49. A style-based generator architecture for generative adversarial networks. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 4401–4410, 2019.
  50. Elucidating the design space of diffusion-based generative models. Advances in Neural Information Processing Systems, 35:26565–26577, 2022.
  51. Lightgbm: A highly efficient gradient boosting decision tree. Advances in neural information processing systems, 30:3146–3154, 2017.
  52. Patrick Kidger. On neural differential equations. arXiv preprint arXiv:2202.02435, 2022.
  53. Stasy: Score-based tabular data synthesis. arXiv preprint arXiv:2210.04018, 2022.
  54. Auto-encoding variational bayes. arXiv preprint arXiv:1312.6114, 2013.
  55. Multiclass classification of dry beans using computer vision and machine learning techniques. Computers and Electronics in Agriculture, 174:105507, 2020.
  56. Diffwave: A versatile diffusion model for audio synthesis. arXiv preprint arXiv:2009.09761, 2020.
  57. Tabddpm: Modelling tabular data with diffusion models. In International Conference on Machine Learning, pages 17564–17579. PMLR, 2023.
  58. The midas touch: accurate and scalable missing-data imputation with deep learning. Political Analysis, 30(2):179–196, 2022.
  59. Multiple imputation: a flexible tool for handling missing data. Jama, 314(18):1966–1967, 2015.
  60. Flow matching for generative modeling, October 2022.
  61. Max Little. Parkinsons. UCI Machine Learning Repository, 2008. DOI: https://doi.org/10.24432/C59C74.
  62. Roderick JA Little. Missing-data adjustments in large surveys. Journal of Business & Economic Statistics, 6(3):287–296, 1988.
  63. Statistical analysis with missing data. John Wiley & Sons, 1987.
  64. The analysis of social science data with missing values. Sociological methods & research, 18(2-3):292–326, 1989.
  65. Flow straight and fast: Learning to generate and transfer data with rectified flow. arXiv preprint arXiv:2209.03003, 2022.
  66. Climate Model Simulation Crashes. UCI Machine Learning Repository, 2013. DOI: https://doi.org/10.24432/C5HG71.
  67. Repaint: Inpainting using denoising diffusion probabilistic models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 11461–11471, 2022.
  68. Diagnostic classification of cancers using extreme gradient boosting algorithm and multi-omics data. Computers in biology and medicine, 121:103761, 2020.
  69. Lightgbm: An effective decision tree gradient boosting method to predict customer loyalty in the finance industry. In 2019 14th International Conference on Computer Science & Education (ICCSE), pages 1111–1116. IEEE, 2019.
  70. QSAR biodegradation. UCI Machine Learning Repository, 2013. DOI: https://doi.org/10.24432/C5H60M.
  71. Sdedit: Guided image synthesis and editing with stochastic differential equations. arXiv preprint arXiv:2108.01073, 2021.
  72. Data augmentation: A comprehensive survey of modern approaches. Array, 16:100258, 2022. ISSN 2590-0056. doi: https://doi.org/10.1016/j.array.2022.100258. URL https://www.sciencedirect.com/science/article/pii/S2590005622000911.
  73. Missing data imputation using optimal transport. In International Conference on Machine Learning, pages 7130–7140. PMLR, 2020.
  74. Reliable fidelity and diversity metrics for generative models. In International Conference on Machine Learning, pages 7176–7185. PMLR, 2020.
  75. Kenta Nakai. Ecoli. UCI Machine Learning Repository, 1996a. DOI: https://doi.org/10.24432/C5388M.
  76. Kenta Nakai. Yeast. UCI Machine Learning Repository, 1996b. DOI: https://doi.org/10.24432/C5KG68.
  77. Permutation invariant graph generation via score-based generative modeling. In International Conference on Artificial Intelligence and Statistics, pages 4474–4484. PMLR, 2020.
  78. Missdiff: Training diffusion models on tabular data with missing values. arXiv preprint arXiv:2307.00467, 2023.
  79. Sparse spatial autoregressions. Statistics & Probability Letters, 33(3):291–297, 1997.
  80. The synthetic data vault. In IEEE International Conference on Data Science and Advanced Analytics (DSAA), pages 399–410, Oct 2016. doi: 10.1109/DSAA.2016.49.
  81. Scikit-learn: Machine learning in Python. Journal of Machine Learning Research, 12:2825–2830, 2011.
  82. Catboost: unbiased boosting with categorical features. Advances in neural information processing systems, 31, 2018.
  83. R Core Team. R: A Language and Environment for Statistical Computing. R Foundation for Statistical Computing, Vienna, Austria, 2021. URL https://www.R-project.org/.
  84. High-resolution image synthesis with latent diffusion models. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 10684–10695, 2022.
  85. Real analysis, volume 2. Macmillan New York, 1968.
  86. Connectionist Bench (Sonar, Mines vs. Rocks). UCI Machine Learning Repository. DOI: https://doi.org/10.24432/C5T01Q.
  87. Tabular data: Deep learning is not all you need. Information Fusion, 81:84–90, 2022.
  88. Ionosphere. UCI Machine Learning Repository, 1989. DOI: https://doi.org/10.24432/C5W01B.
  89. M Sklar. Fonctions de répartition à n dimensions et leurs marges. In Annales de l’ISUP, volume 8, pages 229–231, 1959.
  90. Deep unsupervised learning using nonequilibrium thermodynamics. In International conference on machine learning, pages 2256–2265. PMLR, 2015.
  91. Generative modeling by estimating gradients of the data distribution. Neural Information Processing Systems (NeurIPS), 2019.
  92. Improved techniques for training score-based generative models. Neural Information Processing Systems (NeurIPS), 2020.
  93. Score-based generative modeling through stochastic differential equations. International Conference on Learning Representations (ICLR), 2021.
  94. Missforest—non-parametric missing value imputation for mixed-type data. Bioinformatics, 28(1):112–118, 2012.
  95. Daniel B Suits. Use of dummy variables in regression equations. Journal of the American Statistical Association, 52(280):548–551, 1957.
  96. Fourier features let networks learn high frequency functions in low dimensional domains. Advances in Neural Information Processing Systems, 33:7537–7547, 2020.
  97. Csdi: Conditional score-based diffusion models for probabilistic time series imputation. Advances in Neural Information Processing Systems, 34:24804–24816, 2021.
  98. Simulation-free schrödinger bridges via score and flow matching, 2023a.
  99. Improving and generalizing flow-based generative models with minibatch optimal transport. arXiv preprint 2302.00482, 2023b.
  100. Gradient boosting machine for modeling the energy consumption of commercial buildings. Energy and Buildings, 158:1533–1543, 2018.
  101. Missing value estimation methods for dna microarrays. Bioinformatics, 17(6):520–525, 2001.
  102. Stef Van Buuren. Flexible imputation of missing data. CRC press, 2018.
  103. mice: Multivariate imputation by chained equations in r. Journal of statistical software, 45:1–67, 2011.
  104. Multiple imputation of missing blood pressure covariates in survival analysis. Statistics in medicine, 18(6):681–694, 1999.
  105. Python 3 Reference Manual. CreateSpace, Scotts Valley, CA, 2009. ISBN 1441412697.
  106. Attention is all you need. Advances in neural information processing systems, 30, 2017.
  107. Cédric Villani et al. Optimal transport: old and new, volume 338. Springer, 2009.
  108. Mcvd-masked conditional video diffusion for prediction, generation, and interpolation. Advances in Neural Information Processing Systems, 35:23371–23385, 2022.
  109. Are deep learning models superior for missing data imputation in surveys? evidence from an empirical comparison. Survey Methodology, 48(2):375–399, 2022.
  110. Machine learning refined: Foundations, algorithms, and applications. Cambridge University Press, 2020.
  111. miceforest, 2023. URL https://github.com/AnotherSamWilson/miceforest/.
  112. Breast Cancer Wisconsin (Diagnostic). UCI Machine Learning Repository, 1995. DOI: https://doi.org/10.24432/C5DW2B.
  113. Modeling tabular data using conditional gan. Advances in neural information processing systems, 32, 2019.
  114. I-Cheng Yeh. Concrete Compressive Strength. UCI Machine Learning Repository, 2007. DOI: https://doi.org/10.24432/C5PK67.
  115. I-Cheng Yeh. Blood Transfusion Service Center. UCI Machine Learning Repository, 2008. DOI: https://doi.org/10.24432/C5GS39.
  116. I-Cheng Yeh. Concrete Slump Test. UCI Machine Learning Repository, 2009. DOI: https://doi.org/10.24432/C5FG7D.
  117. Gain: Missing data imputation using generative adversarial nets. In International conference on machine learning, pages 5689–5698. PMLR, 2018.
  118. Imputation as inpainting: Diffusion models for spatiotemporal data imputation. 2023.
  119. An up-to-date comparison of state-of-the-art classification algorithms. Expert Systems with Applications, 82:128–150, 2017.
  120. Ctab-gan: Effective table data synthesizing. In Asian Conference on Machine Learning, pages 97–112. PMLR, 2021.
  121. Ctab-gan+: Enhancing tabular data synthesis. arXiv preprint arXiv:2204.00401, 2022.
  122. Diffusion models for missing value imputation in tabular data. arXiv preprint arXiv:2210.17128, 2022.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (3)
  1. Alexia Jolicoeur-Martineau (22 papers)
  2. Kilian Fatras (18 papers)
  3. Tal Kachman (19 papers)
Citations (20)

Summary

Analyzing the Use of Gradient-Boosted Trees for Tabular Data Generation and Imputation

The paper entitled "Generating and Imputing Tabular Data via Diffusion and Flow-based Gradient-Boosted Trees" by Alexia Jolicoeur-Martineau, Kilian Fatras, and Tal Kachman explores a pivotal research objective in the domain of machine learning—specifically the challenges inherent in generating and imputing mixed-type tabular data. This objective is crucial given the omnipresence of missing data and small-sized training sets across sectors such as economics, medicine, and social sciences.

The researchers present a novel methodology that diverges from traditional practice by eschewing deep-learning models for score-function estimation, relying instead on XGBoost, a notable gradient-boosted tree (GBT) method. This choice is guided by the understanding that GBTs outperform neural networks on tabular data prediction and classification tasks. Through the development of benchmarked empirical evaluations across 27 datasets using 9 metrics, the paper identifies that the proposed approach not only competes effectively in imputation tasks but surpasses deep-learning methods in data generation.

Key contributions of the research include the introduction of the first diffusion and flow models for tabular data generation utilizing GBTs, the ability to train models in parallel using CPUs instead of GPUs, and the capability of handling incomplete data directly during training. These innovations are underscored by the provision of a public repository for the accompanying Python and R implementation, enhancing reproducibility and accessibility within the research community.

Technical Approach

The pivotal technical innovation of this work is the application of XGBoost as a universal function approximator to estimate data distributions through diffusion and flow models. Traditionally, deep neural networks have been used in such models due to their differentiability. However, the authors harness the power of GBTs by adapting the conditional flow matching (CFM) and score-based diffusion frameworks. Through this adaptation, they can train models directly on incomplete datasets, leveraging XGBoost’s ability to learn optimal split directions even with missing values. By circumventing the differentiability requirement, the paper opens a novel path toward employing non-differentiable models such as GBTs for generative tasks.

Evaluation and Results

Empirical results reveal that the approach not only stands competitive with other deep-learning-based imputation methods but demonstrates substantial superiority in generation tasks. A critical analysis across benchmark datasets displayed that Forest-Flow, the paradigm leveraging XGBoost for conditional flow matching, excelled notably in generating realistic synthetic data irrespective of the presence of underlying missing values. The evaluation metrics included Wasserstein distance, coverage, efficiency, and statistical inference validity, covering essential aspects of both diversity and prediction reliability.

Foremost, Forest-Flow's performance closely matches that of TabDDPM, a neural diffusion model, in statistical evaluation measures. The research marks a foundational shift from dependency on computationally intensive GPU resources, illustrating the feasibility of scalable GBT-based methods trained on conventional CPU clusters.

Implications and Future Outlook

The implications of the presented work are notably practical and theoretical. Practically, the introduction of GBT-based generative models reduces the computational overhead, making advanced machine learning techniques more accessible across institutions or researchers with limited resources. Theoretically, it challenges the prevalent notion that neural models are requisite for generative tasks, highlighting the promising potential of alternative approaches.

The paper aptly speculates on several possible future developments. These include employing techniques like multinomial diffusion to enhance performance, improving mini-batch training for GBTs, and extending applications to domains like data augmentation or domain translation. Understanding the mechanisms underlying data generation through feature importance in XGBoost offers an intriguing avenue for exploration.

In concluding, this work significantly broadens the horizons for generative modeling of tabular data, setting the stage for continued research into resource-efficient and practically applicable machine learning methodologies. By requiring neither GPUs nor deep learning frameworks, it lowers the barrier for generative model training, democratizing access to these sophisticated techniques.