Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
143 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
46 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

EPIC: Effective Prompting for Imbalanced-Class Data Synthesis in Tabular Data Classification via Large Language Models (2404.12404v4)

Published 15 Apr 2024 in cs.LG and cs.AI

Abstract: LLMs have demonstrated remarkable in-context learning capabilities across diverse applications. In this work, we explore the effectiveness of LLMs for generating realistic synthetic tabular data, identifying key prompt design elements to optimize performance. We introduce EPIC, a novel approach that leverages balanced, grouped data samples and consistent formatting with unique variable mapping to guide LLMs in generating accurate synthetic data across all classes, even for imbalanced datasets. Evaluations on real-world datasets show that EPIC achieves state-of-the-art machine learning classification performance, significantly improving generation efficiency. These findings highlight the effectiveness of EPIC for synthetic tabular data generation, particularly in addressing class imbalance. Our source code for our work is available at: https://seharanul17.github.io/project-synthetic-tabular-LLM/

Definition Search Book Streamline Icon: https://streamlinehq.com
References (30)
  1. Smote for high-dimensional class-imbalanced data. BMC bioinformatics, 14:1–16, 2013.
  2. Language models are realistic tabular data generators. arXiv preprint arXiv:2210.06280, 2022.
  3. Breiman, L. Random forests. Machine learning, 45:5–32, 2001.
  4. Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901, 2020.
  5. Smote: synthetic minority over-sampling technique. Journal of artificial intelligence research, 16:321–357, 2002.
  6. Boosting deep learning risk prediction with generative adversarial networks for electronic health records. In 2017 IEEE International Conference on Data Mining (ICDM), pp.  787–792. IEEE, 2017.
  7. Danets: Deep abstract networks for tabular data classification and regression. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 36, pp.  3930–3938, 2022.
  8. Xgboost: A scalable tree boosting system. In Proceedings of the 22nd acm sigkdd international conference on knowledge discovery and data mining, pp.  785–794, 2016.
  9. Generating multi-label discrete patient records using generative adversarial networks. In Machine learning for healthcare conference, pp.  286–305. PMLR, 2017.
  10. Friedman, J. H. Greedy function approximation: a gradient boosting machine. Annals of statistics, pp.  1189–1232, 2001.
  11. Smote-nc and gradient boosting imputation based random forest classifier for predicting severity level of covid-19 patients with blood samples. Neural Computing and Applications, 33(22):15693–15707, 2021.
  12. Generative adversarial nets. Advances in neural information processing systems, 27, 2014.
  13. From images to textual prompts: Zero-shot visual question answering with frozen large language models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.  10867–10877, 2023.
  14. Mistral 7b. arXiv preprint arXiv:2310.06825, 2023.
  15. Pate-gan: Generating synthetic data with differential privacy guarantees. In International conference on learning representations, 2018.
  16. Lightgbm: A highly efficient gradient boosting decision tree. Advances in neural information processing systems, 30, 2017.
  17. Auto-encoding variational bayes. arXiv preprint arXiv:1312.6114, 2013.
  18. Large language models are zero-shot reasoners. Advances in neural information processing systems, 35:22199–22213, 2022.
  19. Tabddpm: Modelling tabular data with diffusion models. In International Conference on Machine Learning, pp.  17564–17579. PMLR, 2023.
  20. Large language models as general pattern machines. arXiv preprint arXiv:2307.04721, 2023.
  21. Smote-enc: A novel smote-based method to generate synthetic data for nominal and continuous features. Applied System Innovation, 4(1):18, 2021.
  22. The synthetic data vault. In 2016 IEEE International Conference on Data Science and Advanced Analytics (DSAA), pp.  399–410. IEEE, 2016.
  23. Catboost: unbiased boosting with categorical features. Advances in neural information processing systems, 31, 2018.
  24. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288, 2023.
  25. Generating privacy-preserving synthetic tabular data using oblivious variational autoencoders. In Proceedings of the Workshop on Economics of Privacy and Data Labor at the 37 th International Conference on Machine Learning (ICML), 2020.
  26. Chain-of-thought prompting elicits reasoning in large language models. Advances in Neural Information Processing Systems, 35:24824–24837, 2022.
  27. Modeling tabular data using conditional gan. Advances in neural information processing systems, 32, 2019.
  28. Ctab-gan: Effective table data synthesizing. In Asian Conference on Machine Learning, pp.  97–112. PMLR, 2021.
  29. Ctab-gan+: Enhancing tabular data synthesis. Frontiers in big Data, 6:1296508, 2024.
  30. Large language models are human-level prompt engineers. arXiv preprint arXiv:2211.01910, 2022.
Citations (1)

Summary

We haven't generated a summary for this paper yet.

X Twitter Logo Streamline Icon: https://streamlinehq.com