Reducing the dimensionality and granularity in hierarchical categorical variables (2403.03613v2)

Published 6 Mar 2024 in stat.ME and stat.ML

Abstract: Hierarchical categorical variables often exhibit many levels (high granularity) and many classes within each level (high dimensionality). This may cause overfitting and estimation issues when including such covariates in a predictive model. In current literature, a hierarchical covariate is often incorporated via nested random effects. However, this does not facilitate the assumption of classes having the same effect on the response variable. In this paper, we propose a methodology to obtain a reduced representation of a hierarchical categorical variable. We show how entity embedding can be applied in a hierarchical setting. Subsequently, we propose a top-down clustering algorithm which leverages the information encoded in the embeddings to reduce both the within-level dimensionality as well as the overall granularity of the hierarchical categorical variable. In simulation experiments, we show that our methodology can effectively approximate the true underlying structure of a hierarchical covariate in terms of the effect on a response variable, and find that incorporating the reduced hierarchy improves the balance between model fit and complexity. We apply our methodology on a real dataset and find that the reduced hierarchy is an improvement over the original hierarchical structure and reduced structures proposed in the literature.

Summary

We haven't generated a summary for this paper yet.

Summarize Now

Tweets

https://twitter.com/statCOpapers/status/1765935622634803608

Reducing the dimensionality and granularity in hierarchical categorical variables (2403.03613v2)

Summary

Related Papers

Tweets