Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
97 tokens/sec
GPT-4o
53 tokens/sec
Gemini 2.5 Pro Pro
44 tokens/sec
o3 Pro
5 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

CARTE: Pretraining and Transfer for Tabular Learning (2402.16785v2)

Published 26 Feb 2024 in cs.LG

Abstract: Pretrained deep-learning models are the go-to solution for images or text. However, for tabular data the standard is still to train tree-based models. Indeed, transfer learning on tables hits the challenge of data integration: finding correspondences, correspondences in the entries (entity matching) where different words may denote the same entity, correspondences across columns (schema matching), which may come in different orders, names... We propose a neural architecture that does not need such correspondences. As a result, we can pretrain it on background data that has not been matched. The architecture -- CARTE for Context Aware Representation of Table Entries -- uses a graph representation of tabular (or relational) data to process tables with different columns, string embedding of entries and columns names to model an open vocabulary, and a graph-attentional network to contextualize entries with column names and neighboring entries. An extensive benchmark shows that CARTE facilitates learning, outperforming a solid set of baselines including the best tree-based models. CARTE also enables joint learning across tables with unmatched columns, enhancing a small table with bigger ones. CARTE opens the door to large pretrained models for tabular data.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (51)
  1. DNF-Net: A Neural Architecture for Tabular Data, June 2020.
  2. TabNet: Attentive Interpretable Tabular Learning, December 2020.
  3. Multi-relational poincaré graph embeddings. Advances in Neural Information Processing Systems, 32, 2019.
  4. On the opportunities and risks of foundation models. arXiv preprint arXiv:2108.07258, 2021.
  5. Breiman, L. Bagging predictors. Machine learning, 24:123–140, 1996.
  6. Encoding high-cardinality string categorical variables. IEEE Transactions on Knowledge and Data Engineering, 34(3):1164–1176, March 2022. ISSN 1041-4347, 1558-2191, 2326-3865. doi: 10.1109/TKDE.2020.2992529.
  7. Measuring and relieving the over-smoothing problem for graph neural networks from the topological view. In Proceedings of the AAAI conference on artificial intelligence, volume 34, pp.  3438–3445, 2020a.
  8. ExcelFormer: A Neural Network Surpassing GBDTs on Tabular Data, January 2023.
  9. Xgboost: A scalable tree boosting system. In Proceedings of the 22nd acm sigkdd international conference on knowledge discovery and data mining, pp.  785–794, 2016.
  10. A simple framework for contrastive learning of visual representations. In International conference on machine learning, pp.  1597–1607. PMLR, 2020b.
  11. A large-scaled corpus for assessing text readability. Behavior Research Methods, 55(2):491–507, 2023.
  12. Relational Data Embeddings for Feature Enrichment with Background Information. Machine Learning, 112, 2022. doi: 10.1007/s10994-022-06277-7.
  13. Relational data embeddings for feature enrichment with background information. Machine Learning, 112(2):687–720, 2023.
  14. TURL: Table Understanding through Representation Learning, December 2020.
  15. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding, May 2019.
  16. Principles of data integration. Elsevier, 2012.
  17. CatBoost: Gradient boosting with categorical features support. arXiv preprint arXiv:1810.11363, 2018.
  18. TabLib: A Dataset of 627M Tables with Context, October 2023.
  19. Subgroup Robustness Grows On Trees: An Empirical Baseline Investigation. Advances in Neural Information Processing Systems, 35:9939–9954, December 2022.
  20. TabR: Tabular Deep Learning Meets Nearest Neighbors in 2023, October 2023a.
  21. Revisiting Deep Learning Models for Tabular Data, October 2023b.
  22. Why do tree-based models still outperform deep learning on tabular data?, July 2022.
  23. TabLLM: Few-shot Classification of Tabular Data with Large Language Models, March 2023.
  24. TabPFN: A Transformer That Solves Small Tabular Classification Problems in a Second, September 2023.
  25. Sherlock: A deep learning approach to semantic data type detection. In Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, pp.  1500–1508, 2019.
  26. GitTables: A Large-Scale Corpus of Relational Tables. Proceedings of the ACM on Management of Data, 1(1):1–17, May 2023. ISSN 2836-6573. doi: 10.1145/3588710.
  27. Deep learning for time series classification: a review. Data Mining and Knowledge Discovery, 33(4):917–963, 2019.
  28. Bag of Tricks for Efficient Text Classification. In Lapata, M., Blunsom, P., and Koller, A. (eds.), Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume 2, Short Papers, pp.  427–431, Valencia, Spain, April 2017. Association for Computational Linguistics.
  29. Transfer Learning with Deep Tabular Models, August 2023.
  30. Yago3: A knowledge base from multilingual wikipedias. In CIDR, 2013.
  31. When Do Neural Nets Outperform Boosted Trees on Tabular Data?, October 2023.
  32. Micci-Barreca, D. A preprocessing scheme for high-cardinality categorical attributes in classification and prediction problems. ACM SIGKDD Explorations Newsletter, 3(1):27–32, 2001.
  33. Advances in pre-training distributed word representations. arXiv preprint arXiv:1712.09405, 2017.
  34. Can foundation models wrangle your data? Proceedings of the VLDB Endowment, 16(4):738–746, 2022.
  35. Representation learning with contrastive predictive coding. arXiv preprint arXiv:1807.03748, 2018.
  36. Scikit-learn: Machine learning in Python. Journal of Machine Learning Research, 12:2825–2830, 2011.
  37. Neural Oblivious Decision Ensembles for Deep Learning on Tabular Data, September 2019.
  38. A survey on oversmoothing in graph neural networks. arXiv preprint arXiv:2303.10993, 2023.
  39. The Magellan Data Repository, 2023.
  40. Tabular Data: Deep Learning is Not All You Need, November 2021.
  41. Very deep convolutional networks for large-scale image recognition. In International Conference on Learning Representations (ICLR 2015), 2015.
  42. Skrub. Skrub, prepping tables for machine learning. https://skrub-data.org, 2024.
  43. SAINT: Improved Neural Networks for Tabular Data via Row Attention and Contrastive Pre-Training, June 2021.
  44. Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971, 2023.
  45. UCI. UC Irvine Machine Learning Repository. https://archive.ics.uci.edu.
  46. Attention is all you need. Advances in neural information processing systems, 30, 2017.
  47. Graph attention networks. stat, 1050(20):10–48550, 2017.
  48. Text embeddings by weakly-supervised contrastive pre-training. arXiv preprint arXiv:2212.03533, 2022.
  49. Transtab: Learning transferable tabular transformers across tables. Advances in Neural Information Processing Systems, 35:2902–2915, 2022.
  50. A new family of power transformations to improve normality or symmetry. Biometrika, 87(4):954–959, 2000.
  51. Xtab: Cross-table pretraining for tabular transformers. arXiv preprint arXiv:2305.06090, 2023.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (3)
  1. Myung Jun Kim (3 papers)
  2. Léo Grinsztajn (7 papers)
  3. Gaël Varoquaux (87 papers)
Citations (5)

Summary

We haven't generated a summary for this paper yet.