Tabular Learning: Encoding for Entity and Context Embeddings (2403.19405v1)
Abstract: Examining the effect of different encoding techniques on entity and context embeddings, the goal of this work is to challenge commonly used Ordinal encoding for tabular learning. Applying different preprocessing methods and network architectures over several datasets resulted in a benchmark on how the encoders influence the learning outcome of the networks. By keeping the test, validation and training data consistent, results have shown that ordinal encoding is not the most suited encoder for categorical data in terms of preprocessing the data and thereafter, classifying the target variable correctly. A better outcome was achieved, encoding the features based on string similarities by computing a similarity matrix as input for the network. This is the case for both, entity and context embeddings, where the transformer architecture showed improved performance for Ordinal and Similarity encoding with regard to multi-label classification tasks.
- Leo Breiman “Random Forests” In Machine Learning 45.1 Kluwer Academic Publishers, 2001, pp. 5–32 DOI: 10.1023/A:1010933404324
- “XGBoost: A Scalable Tree Boosting System” In KDD ’16: Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining New York, NY, USA: Association for Computing Machinery, 2016, pp. 785–794 DOI: 10.1145/2939672.2939785
- Leo Grinsztajn, Edouard Oyallon and Gael Varoquaux “Why do tree-based models still outperform deep learning on typical tabular data?” In Advances in Neural Information Processing Systems 35, 2022, pp. 507–520 URL: https://proceedings.neurips.cc/paper_files/paper/2022/hash/0378c7692da36807bdec87ab043cdadc-Abstract-Datasets_and_Benchmarks.html
- “Tabular Data: Deep Learning is Not All You Need” In arXiv, 2021 DOI: 10.48550/arXiv.2106.03253
- “Entity Embeddings of Categorical Variables” In arXiv, 2016 DOI: 10.48550/arXiv.1604.06737
- Sercan O. Arik and Tomas Pfister “TabNet: Attentive Interpretable Tabular Learning” In arXiv, 2019 DOI: 10.48550/arXiv.1908.07442
- “TabTransformer: Tabular Data Modeling Using Contextual Embeddings” In arXiv, 2020 DOI: 10.48550/arXiv.2012.06678
- “TabPFN: A Transformer That Solves Small Tabular Classification Problems in a Second” In arXiv, 2022 DOI: 10.48550/arXiv.2207.01848
- John T. Hancock and Taghi M. Khoshgoftaar “Survey on categorical data for neural networks” In J. Big Data 7.1 SpringerOpen, 2020, pp. 1–41 DOI: 10.1186/s40537-020-00305-w
- Ankit Gupta, Kishan G Mehrotra and Chilukuri Mohan “A clustering-based discretization for supervised learning” In Statistics & probability letters 80.9-10 Elsevier, 2010, pp. 816–824
- “Comparison of supervised and unsupervised learning algorithms for pattern classification” In International Journal of Advanced Research in Artificial Intelligence 2.2 Citeseer, 2013, pp. 34–38
- Michal R. Chmielewski and Jerzy W. Grzymala-Busse “Global discretization of continuous attributes as preprocessing for machine learning” In Int. J. Approximate Reasoning 15.4 Elsevier, 1996, pp. 319–331 DOI: 10.1016/S0888-613X(96)00074-6
- Aristidis Likas, Nikos Vlassis and Jakob J.Verbeek “The global k-means clustering algorithm” In Pattern Recognit. 36.2 Pergamon, 2003, pp. 451–461 DOI: 10.1016/S0031-3203(02)00060-2
- “Discretization techniques: A recent survey” In GESTS International Transactions on Computer Science and Engineering 32.1 Citeseer, 2006, pp. 47–58
- C-C Chan, Celai Batur and Arvind Srinivasan “Determination of quantization intervals in rule based model for dynamic systems” In Conference Proceedings 1991 IEEE International Conference on Systems, Man, and Cybernetics, 1991, pp. 1719–1723 IEEE
- Andrew KC Wong and David KY Chiu “Synthesizing statistical knowledge from incomplete mixed-mode data” In IEEE Transactions on Pattern Analysis and Machine Intelligence IEEE, 1987, pp. 796–805
- James MacQueen “Some methods for classification and analysis of multivariate observations” In Proceedings of the fifth Berkeley symposium on mathematical statistics and probability 1.14, 1967, pp. 281–297 Oakland, CA, USA
- Stuart Lloyd “Least squares quantization in PCM” In IEEE transactions on information theory 28.2 IEEE, 1982, pp. 129–137
- J.Ross Quinlan “Induction of decision trees” In Machine learning 1 Springer, 1986, pp. 81–106
- J Ross Quinlan “C4. 5: programs for machine learning” Elsevier, 2014
- Leo Breiman “Classification and regression trees” Routledge, 2017
- “Rank entropy-based decision trees for monotonic classification” In IEEE Transactions on Knowledge and Data Engineering 24.11 IEEE, 2011, pp. 2052–2064
- Elena Fitkov-Norris, Samireh Vahid and Chris Hand “Evaluating the impact of categorical data encoding and scaling on neural network classification performance: The case of repeat consumption of identical cultural goods” In Engineering Applications of Neural Networks: 13th International Conference, EANN 2012, London, UK, September 20-23, 2012. Proceedings 13, 2012, pp. 343–352 Springer
- Koichi Takayama “Encoding Categorical Variables with Ambiguity” In Proceedings of the International Workshop NFMCP in conjunction with ECML-PKDD, Tokyo, Japan 16, 2019
- “Categorical features transformation with compact one-hot encoder for fraud detection in distributed environment” In Data Mining: 16th Australasian Conference, AusDM 2018, Bahrurst, NSW, Australia, November 28–30, 2018, Revised Selected Papers 16, 2019, pp. 69–80 Springer
- Daniele Micci-Barreca “A preprocessing scheme for high-cardinality categorical attributes in classification and prediction problems” In ACM SIGKDD Explorations Newsletter 3.1 ACM New York, NY, USA, 2001, pp. 27–32
- “Quantile encoder: Tackling high cardinality categorical features in regression problems” In Modeling Decisions for Artificial Intelligence: 18th International Conference, MDAI 2021, Umeå, Sweden, September 27–30, 2021, Proceedings 18, 2021, pp. 168–180 Springer
- “An overview of word and sense similarity” In Natural Language Engineering 25.6 Cambridge University Press, 2019, pp. 693–714
- Patricio Cerda, Gaël Varoquaux and Balázs Kégl “Similarity encoding for learning with dirty categorical variables” In Machine Learning 107.8-10 Springer, 2018, pp. 1477–1494
- “UCI machine learning repository” Irvine, CA, USA, 2007
- “Adult” DOI: https://doi.org/10.24432/C5XW20, UCI Machine Learning Repository, 1996
- “Mushroom” DOI: https://doi.org/10.24432/C5959T, UCI Machine Learning Repository, 1987
- Rita P. Moro S. and Cortez P. “Bank Marketing” DOI: https://doi.org/10.24432/C5K306, UCI Machine Learning Repository, 2012
- William Wolberg “Breast Cancer Wisconsin (Original)” DOI: https://doi.org/10.24432/C5HP4Z, UCI Machine Learning Repository, 1992
- Hans Hofmann “Statlog (German Credit Data)” DOI: https://doi.org/10.24432/C5NC77, UCI Machine Learning Repository, 1994
- “Spambase” DOI: https://doi.org/10.24432/C53G6X, UCI Machine Learning Repository, 1999
- Marko Bohanec “Car Evaluation” DOI: https://doi.org/10.24432/C5JP48, UCI Machine Learning Repository, 1997
- Tjen-Sien Lim “Contraceptive Method Choice” DOI: https://doi.org/10.24432/C59W2D, UCI Machine Learning Repository, 1997
- Vladislav Rajkovic “Nursery” DOI: https://doi.org/10.24432/C5P88W, UCI Machine Learning Repository, 1997
- R. Siegler “Balance Scale” DOI: https://doi.org/10.24432/C5488X, UCI Machine Learning Repository, 1994
- Claude Elwood Shannon “A mathematical theory of communication” In The Bell system technical journal 27.3 Nokia Bell Labs, 1948, pp. 379–423
- Valliappa Lakshmanan, Sara Robinson and Michael Munn “Machine learning design patterns” O’Reilly Media, 2020
- “Principled approach to the selection of the embedding dimension of networks” In Nature Communications 12.1 Nature Publishing Group UK London, 2021, pp. 3772
- Diederik P Kingma and Jimmy Ba “Adam: A method for stochastic optimization” In arXiv preprint arXiv:1412.6980, 2014
- “Attention is all you need” In Advances in neural information processing systems 30, 2017
- Leon Derczynski “Complementarity, F-score, and NLP Evaluation” In Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC’16), 2016, pp. 261–266