Papers
Topics
Authors
Recent
Search
2000 character limit reached

A Comparison of Machine Learning Methods for Data with High-Cardinality Categorical Variables

Published 5 Jul 2023 in cs.LG, cs.AI, and stat.ML | (2307.02071v1)

Abstract: High-cardinality categorical variables are variables for which the number of different levels is large relative to the sample size of a data set, or in other words, there are few data points per level. Machine learning methods can have difficulties with high-cardinality variables. In this article, we empirically compare several versions of two of the most successful machine learning methods, tree-boosting and deep neural networks, and linear mixed effects models using multiple tabular data sets with high-cardinality categorical variables. We find that, first, machine learning models with random effects have higher prediction accuracy than their classical counterparts without random effects, and, second, tree-boosting with random effects outperforms deep neural networks with random effects.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (18)
  1. Machine Learning with High-Cardinality Categorical Features in Actuarial Applications. arXiv preprint arXiv:2301.12710, 2023.
  2. T. Chen and C. Guestrin. XGBoost: A scalable tree boosting system. In Proceedings of the 22nd acm sigkdd international conference on knowledge discovery and data mining, pages 785–794. ACM, 2016.
  3. W. D. Fisher. On grouping for maximum homogeneity. Journal of the American statistical Association, 53(284):789–798, 1958.
  4. W. Fu and J. S. Simonoff. Unbiased regression trees for longitudinal and clustered data. Computational Statistics & Data Analysis, 88:53–74, 2015.
  5. Why do tree-based models still outperform deep learning on typical tabular data? In S. Koyejo, S. Mohamed, A. Agarwal, D. Belgrave, K. Cho, and A. Oh, editors, Advances in Neural Information Processing Systems, volume 35, pages 507–520. Curran Associates, Inc., 2022.
  6. C. Guo and F. Berkhahn. Entity embeddings of categorical variables. arXiv preprint arXiv:1604.06737, 2016.
  7. Mixed effects regression trees for clustered data. Statistics & probability letters, 81(4):451–459, 2011.
  8. Mixed-effects random forest for clustered data. Journal of Statistical Computation and Simulation, 84(6):1313–1328, 2014.
  9. LightGBM: A highly efficient gradient boosting decision tree. In Advances in Neural Information Processing Systems, pages 3149–3157, 2017.
  10. Random-effects models for longitudinal data. Biometrics, 38(4):963–974, 1982.
  11. J. Pinheiro and D. Bates. Mixed-effects models in S and S-PLUS. Springer Science & Business Media, 2006.
  12. CatBoost: unbiased boosting with categorical features. In Advances in Neural Information Processing Systems, pages 6638–6648, 2018.
  13. RE-EM trees: a data mining approach for longitudinal and clustered data. Machine learning, 86(2):169–207, 2012.
  14. R. Shwartz-Ziv and A. Armon. Tabular data: Deep learning is not all you need. Information Fusion, 81:84–90, 2022.
  15. F. Sigrist. Gaussian Process Boosting. The Journal of Machine Learning Research, 23(1):10565–10610, 2022.
  16. F. Sigrist. Latent Gaussian Model Boosting. IEEE Transactions on Pattern Analysis and Machine Intelligence, 45(2):1894–1905, 2023.
  17. G. Simchoni and S. Rosset. Using random effects to account for high-cardinality categorical features and repeated measures in deep neural networks. Advances in Neural Information Processing Systems, 34:25111–25122, 2021.
  18. G. Simchoni and S. Rosset. Integrating Random Effects in Deep Neural Networks. Journal of Machine Learning Research, 24(156):1–57, 2023.
Citations (4)

Summary

Paper to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Continue Learning

We haven't generated follow-up questions for this paper yet.

Authors (1)

Collections

Sign up for free to add this paper to one or more collections.