Naive Bayes Classifiers and One-hot Encoding of Categorical Variables (2404.18190v1)
Abstract: This paper investigates the consequences of encoding a $K$-valued categorical variable incorrectly as $K$ bits via one-hot encoding, when using a Na\"{\i}ve Bayes classifier. This gives rise to a product-of-Bernoullis (PoB) assumption, rather than the correct categorical Na\"{\i}ve Bayes classifier. The differences between the two classifiers are analysed mathematically and experimentally. In our experiments using probability vectors drawn from a Dirichlet distribution, the two classifiers are found to agree on the maximum a posteriori class label for most cases, although the posterior probabilities are usually greater for the PoB case.
- Murphy, K. P. (2022). Probabilistic Machine Learning: An Introduction. MIT Press.
- Scikit-learn: Machine learning in Python. Journal of Machine Learning Research, 12:2825–2830.
- Applied Regression Analysis: A Research Tool. Springer-Verlag, New York, second edition.
- Skogholt, M. (2020). fastNaiveBayes: Extremely Fast Implementation of a Naive Bayes Classifier. R package version 2.2.1.
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.
Top Community Prompts
Collections
Sign up for free to add this paper to one or more collections.