Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
153 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
45 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Tabular Learning: Encoding for Entity and Context Embeddings (2403.19405v1)

Published 28 Mar 2024 in cs.LG, cs.AI, and cs.CE

Abstract: Examining the effect of different encoding techniques on entity and context embeddings, the goal of this work is to challenge commonly used Ordinal encoding for tabular learning. Applying different preprocessing methods and network architectures over several datasets resulted in a benchmark on how the encoders influence the learning outcome of the networks. By keeping the test, validation and training data consistent, results have shown that ordinal encoding is not the most suited encoder for categorical data in terms of preprocessing the data and thereafter, classifying the target variable correctly. A better outcome was achieved, encoding the features based on string similarities by computing a similarity matrix as input for the network. This is the case for both, entity and context embeddings, where the transformer architecture showed improved performance for Ordinal and Similarity encoding with regard to multi-label classification tasks.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (46)
  1. Leo Breiman “Random Forests” In Machine Learning 45.1 Kluwer Academic Publishers, 2001, pp. 5–32 DOI: 10.1023/A:1010933404324
  2. “XGBoost: A Scalable Tree Boosting System” In KDD ’16: Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining New York, NY, USA: Association for Computing Machinery, 2016, pp. 785–794 DOI: 10.1145/2939672.2939785
  3. Leo Grinsztajn, Edouard Oyallon and Gael Varoquaux “Why do tree-based models still outperform deep learning on typical tabular data?” In Advances in Neural Information Processing Systems 35, 2022, pp. 507–520 URL: https://proceedings.neurips.cc/paper_files/paper/2022/hash/0378c7692da36807bdec87ab043cdadc-Abstract-Datasets_and_Benchmarks.html
  4. “Tabular Data: Deep Learning is Not All You Need” In arXiv, 2021 DOI: 10.48550/arXiv.2106.03253
  5. “Entity Embeddings of Categorical Variables” In arXiv, 2016 DOI: 10.48550/arXiv.1604.06737
  6. Sercan O. Arik and Tomas Pfister “TabNet: Attentive Interpretable Tabular Learning” In arXiv, 2019 DOI: 10.48550/arXiv.1908.07442
  7. “TabTransformer: Tabular Data Modeling Using Contextual Embeddings” In arXiv, 2020 DOI: 10.48550/arXiv.2012.06678
  8. “TabPFN: A Transformer That Solves Small Tabular Classification Problems in a Second” In arXiv, 2022 DOI: 10.48550/arXiv.2207.01848
  9. John T. Hancock and Taghi M. Khoshgoftaar “Survey on categorical data for neural networks” In J. Big Data 7.1 SpringerOpen, 2020, pp. 1–41 DOI: 10.1186/s40537-020-00305-w
  10. Ankit Gupta, Kishan G Mehrotra and Chilukuri Mohan “A clustering-based discretization for supervised learning” In Statistics & probability letters 80.9-10 Elsevier, 2010, pp. 816–824
  11. “Comparison of supervised and unsupervised learning algorithms for pattern classification” In International Journal of Advanced Research in Artificial Intelligence 2.2 Citeseer, 2013, pp. 34–38
  12. Michal R. Chmielewski and Jerzy W. Grzymala-Busse “Global discretization of continuous attributes as preprocessing for machine learning” In Int. J. Approximate Reasoning 15.4 Elsevier, 1996, pp. 319–331 DOI: 10.1016/S0888-613X(96)00074-6
  13. Aristidis Likas, Nikos Vlassis and Jakob J.Verbeek “The global k-means clustering algorithm” In Pattern Recognit. 36.2 Pergamon, 2003, pp. 451–461 DOI: 10.1016/S0031-3203(02)00060-2
  14. “Discretization techniques: A recent survey” In GESTS International Transactions on Computer Science and Engineering 32.1 Citeseer, 2006, pp. 47–58
  15. C-C Chan, Celai Batur and Arvind Srinivasan “Determination of quantization intervals in rule based model for dynamic systems” In Conference Proceedings 1991 IEEE International Conference on Systems, Man, and Cybernetics, 1991, pp. 1719–1723 IEEE
  16. Andrew KC Wong and David KY Chiu “Synthesizing statistical knowledge from incomplete mixed-mode data” In IEEE Transactions on Pattern Analysis and Machine Intelligence IEEE, 1987, pp. 796–805
  17. James MacQueen “Some methods for classification and analysis of multivariate observations” In Proceedings of the fifth Berkeley symposium on mathematical statistics and probability 1.14, 1967, pp. 281–297 Oakland, CA, USA
  18. Stuart Lloyd “Least squares quantization in PCM” In IEEE transactions on information theory 28.2 IEEE, 1982, pp. 129–137
  19. J.Ross Quinlan “Induction of decision trees” In Machine learning 1 Springer, 1986, pp. 81–106
  20. J Ross Quinlan “C4. 5: programs for machine learning” Elsevier, 2014
  21. Leo Breiman “Classification and regression trees” Routledge, 2017
  22. “Rank entropy-based decision trees for monotonic classification” In IEEE Transactions on Knowledge and Data Engineering 24.11 IEEE, 2011, pp. 2052–2064
  23. Elena Fitkov-Norris, Samireh Vahid and Chris Hand “Evaluating the impact of categorical data encoding and scaling on neural network classification performance: The case of repeat consumption of identical cultural goods” In Engineering Applications of Neural Networks: 13th International Conference, EANN 2012, London, UK, September 20-23, 2012. Proceedings 13, 2012, pp. 343–352 Springer
  24. Koichi Takayama “Encoding Categorical Variables with Ambiguity” In Proceedings of the International Workshop NFMCP in conjunction with ECML-PKDD, Tokyo, Japan 16, 2019
  25. “Categorical features transformation with compact one-hot encoder for fraud detection in distributed environment” In Data Mining: 16th Australasian Conference, AusDM 2018, Bahrurst, NSW, Australia, November 28–30, 2018, Revised Selected Papers 16, 2019, pp. 69–80 Springer
  26. Daniele Micci-Barreca “A preprocessing scheme for high-cardinality categorical attributes in classification and prediction problems” In ACM SIGKDD Explorations Newsletter 3.1 ACM New York, NY, USA, 2001, pp. 27–32
  27. “Quantile encoder: Tackling high cardinality categorical features in regression problems” In Modeling Decisions for Artificial Intelligence: 18th International Conference, MDAI 2021, Umeå, Sweden, September 27–30, 2021, Proceedings 18, 2021, pp. 168–180 Springer
  28. “An overview of word and sense similarity” In Natural Language Engineering 25.6 Cambridge University Press, 2019, pp. 693–714
  29. Patricio Cerda, Gaël Varoquaux and Balázs Kégl “Similarity encoding for learning with dirty categorical variables” In Machine Learning 107.8-10 Springer, 2018, pp. 1477–1494
  30. “UCI machine learning repository” Irvine, CA, USA, 2007
  31. “Adult” DOI: https://doi.org/10.24432/C5XW20, UCI Machine Learning Repository, 1996
  32. “Mushroom” DOI: https://doi.org/10.24432/C5959T, UCI Machine Learning Repository, 1987
  33. Rita P. Moro S. and Cortez P. “Bank Marketing” DOI: https://doi.org/10.24432/C5K306, UCI Machine Learning Repository, 2012
  34. William Wolberg “Breast Cancer Wisconsin (Original)” DOI: https://doi.org/10.24432/C5HP4Z, UCI Machine Learning Repository, 1992
  35. Hans Hofmann “Statlog (German Credit Data)” DOI: https://doi.org/10.24432/C5NC77, UCI Machine Learning Repository, 1994
  36. “Spambase” DOI: https://doi.org/10.24432/C53G6X, UCI Machine Learning Repository, 1999
  37. Marko Bohanec “Car Evaluation” DOI: https://doi.org/10.24432/C5JP48, UCI Machine Learning Repository, 1997
  38. Tjen-Sien Lim “Contraceptive Method Choice” DOI: https://doi.org/10.24432/C59W2D, UCI Machine Learning Repository, 1997
  39. Vladislav Rajkovic “Nursery” DOI: https://doi.org/10.24432/C5P88W, UCI Machine Learning Repository, 1997
  40. R. Siegler “Balance Scale” DOI: https://doi.org/10.24432/C5488X, UCI Machine Learning Repository, 1994
  41. Claude Elwood Shannon “A mathematical theory of communication” In The Bell system technical journal 27.3 Nokia Bell Labs, 1948, pp. 379–423
  42. Valliappa Lakshmanan, Sara Robinson and Michael Munn “Machine learning design patterns” O’Reilly Media, 2020
  43. “Principled approach to the selection of the embedding dimension of networks” In Nature Communications 12.1 Nature Publishing Group UK London, 2021, pp. 3772
  44. Diederik P Kingma and Jimmy Ba “Adam: A method for stochastic optimization” In arXiv preprint arXiv:1412.6980, 2014
  45. “Attention is all you need” In Advances in neural information processing systems 30, 2017
  46. Leon Derczynski “Complementarity, F-score, and NLP Evaluation” In Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC’16), 2016, pp. 261–266
Citations (1)

Summary

We haven't generated a summary for this paper yet.