Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
97 tokens/sec
GPT-4o
53 tokens/sec
Gemini 2.5 Pro Pro
44 tokens/sec
o3 Pro
5 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

UniTabE: A Universal Pretraining Protocol for Tabular Foundation Model in Data Science (2307.09249v2)

Published 18 Jul 2023 in cs.LG, cs.AI, and cs.CL

Abstract: Recent advancements in NLP have witnessed the groundbreaking impact of pretrained models, yielding impressive outcomes across various tasks. This study seeks to extend the power of pretraining methodologies to facilitating the prediction over tables in data science, a domain traditionally overlooked, yet inherently challenging due to the plethora of table schemas intrinsic to different tasks. The primary research questions underpinning this work revolve around the establishment of a universal pretraining protocol for tables with varied structures, the generalizability and transferability of learned knowledge across tasks, the adaptation to diverse downstream applications, and the incorporation of incremental columns over time. In response to these challenges, we introduce UniTabE, a straightforward yet effective method designed to process tables in a uniform manner, devoid of constraints imposed by specific table structures. UniTabE's core concept relies on representing each basic table element with a module, termed TabUnit. This is subsequently followed by a Transformer encoder to refine the representation. Moreover, our model is designed to facilitate pretraining and finetuning through the utilization of free-form prompts. In order to implement the pretraining phase, we curated an expansive tabular dataset comprising approximately 13B samples, meticulously gathered from the Kaggle platform. This research primarily centers on classification and regression tasks involving tabular data, and conducts rigorous experimental testing and analyses to validate the effectiveness of our methodology. The experimental results demonstrate UniTabE's superior performance against several baselines across massive benchmarks. This, therefore, underscores UniTabE's potential to significantly enhance the semantic representation of tabular data, thereby marking a significant stride for tabular data analysis.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (31)
  1. S. Ö. Arik and T. Pfister. Tabnet: Attentive interpretable tabular learning. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 35, pages 6679–6687, 2021.
  2. Efficient training of language models to fill in the middle. arXiv preprint arXiv:2207.14255, 2022.
  3. Recall and learn: Fine-tuning deep pretrained language models with less forgetting. arXiv preprint arXiv:2004.12651, 2020.
  4. T. Chen and C. Guestrin. Xgboost: A scalable tree boosting system. In Proceedings of the 22nd acm sigkdd international conference on knowledge discovery and data mining, pages 785–794, 2016.
  5. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805, 2018.
  6. Mate: Multi-view attention for table transformer efficiency. arXiv preprint arXiv:2109.04312, 2021.
  7. Revisiting deep learning models for tabular data. Advances in Neural Information Processing Systems, 34:18932–18943, 2021.
  8. On embeddings for numerical features in tabular deep learning. Advances in Neural Information Processing Systems, 35:24991–25004, 2022.
  9. Tapas: Weakly supervised table parsing via pre-training. arXiv preprint arXiv:2004.02349, 2020.
  10. S. Hochreiter and J. Schmidhuber. Long short-term memory. Neural computation, 9(8):1735–1780, 1997.
  11. Tabpfn: A transformer that solves small tabular classification problems in a second. arXiv preprint arXiv:2207.01848, 2022.
  12. Tabtransformer: Tabular data modeling using contextual embeddings. arXiv preprint arXiv:2012.06678, 2020.
  13. M. Joseph and H. Raj. Gandalf: Gated adaptive network for deep automated learning of features, 2023.
  14. Deep encoder, shallow decoder: Reevaluating non-autoregressive machine translation. arXiv preprint arXiv:2006.10369, 2020.
  15. Supervised contrastive learning. Advances in neural information processing systems, 33:18661–18673, 2020.
  16. D. P. Kingma and J. Ba. Adam: A method for stochastic optimization. In 3rd International Conference on Learning Representations, ICLR 2015, San Diego, CA, USA, May 7-9, 2015, Conference Track Proceedings, 2015. URL http://arxiv.org/abs/1412.6980.
  17. Multilingual neural machine translation with deep encoder and multiple shallow decoders. arXiv preprint arXiv:2206.02079, 2022.
  18. Bart: Denoising sequence-to-sequence pre-training for natural language generation, translation, and comprehension. arXiv preprint arXiv:1910.13461, 2019.
  19. An efficient transformer decoder with compressed sub-layers. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 35, pages 13315–13323, 2021.
  20. Ptab: Using the pre-trained language model for modeling tabular data. arXiv preprint arXiv:2209.08060, 2022.
  21. Neural oblivious decision ensembles for deep learning on tabular data. arXiv preprint arXiv:1909.06312, 2019.
  22. Saint: Improved neural networks for tabular data via row attention and contrastive pre-training. arXiv preprint arXiv:2106.01342, 2021.
  23. Autoint: Automatic feature interaction learning via self-attentive neural networks. In Proceedings of the 28th ACM international conference on information and knowledge management, pages 1161–1170, 2019.
  24. Instantaneous grammatical error correction with shallow aggressive decoding. arXiv preprint arXiv:2106.04970, 2021.
  25. What makes for good views for contrastive learning? Advances in neural information processing systems, 33:6827–6839, 2020.
  26. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288, 2023.
  27. Attention is all you need. Advances in neural information processing systems, 30, 2017.
  28. Z. Wang and J. Sun. Transtab: Learning transferable tabular transformers across tables. arXiv preprint arXiv:2205.09328, 2022.
  29. Tuta: tree-based transformers for generally structured table pre-training. In Proceedings of the 27th ACM SIGKDD Conference on Knowledge Discovery & Data Mining, pages 1780–1790, 2021.
  30. TaBERT: Pretraining for joint understanding of textual and tabular data. In Annual Conference of the Association for Computational Linguistics (ACL), July 2020a.
  31. Tabert: Pretraining for joint understanding of textual and tabular data. arXiv preprint arXiv:2005.08314, 2020b.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (5)
  1. Yazheng Yang (16 papers)
  2. Yuqi Wang (62 papers)
  3. Guang Liu (30 papers)
  4. Ledell Wu (16 papers)
  5. Qi Liu (487 papers)
Citations (11)

Summary

We haven't generated a summary for this paper yet.