A Comparison of Machine Learning Methods for Data with High-Cardinality Categorical Variables (2307.02071v1)

Published 5 Jul 2023 in cs.LG, cs.AI, and stat.ML

Abstract: High-cardinality categorical variables are variables for which the number of different levels is large relative to the sample size of a data set, or in other words, there are few data points per level. Machine learning methods can have difficulties with high-cardinality variables. In this article, we empirically compare several versions of two of the most successful machine learning methods, tree-boosting and deep neural networks, and linear mixed effects models using multiple tabular data sets with high-cardinality categorical variables. We find that, first, machine learning models with random effects have higher prediction accuracy than their classical counterparts without random effects, and, second, tree-boosting with random effects outperforms deep neural networks with random effects.

References (18)

Citations (4)

View on Semantic Scholar

Summary

We haven't generated a summary for this paper yet.

Summarize Now

A Comparison of Machine Learning Methods for Data with High-Cardinality Categorical Variables (2307.02071v1)

Summary

Related Papers