Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
156 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
45 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Towards Cross-Table Masked Pretraining for Web Data Mining (2307.04308v2)

Published 10 Jul 2023 in cs.LG

Abstract: Tabular data pervades the landscape of the World Wide Web, playing a foundational role in the digital architecture that underpins online information. Given the recent influence of large-scale pretrained models like ChatGPT and SAM across various domains, exploring the application of pretraining techniques for mining tabular data on the web has emerged as a highly promising research direction. Indeed, there have been some recent works around this topic where most (if not all) of them are limited in the scope of a fixed-schema/single table. Due to the scale of the dataset and the parameter size of the prior models, we believe that we have not reached the ''BERT moment'' for the ubiquitous tabular data. The development on this line significantly lags behind the counterpart research domains such as natural language processing. In this work, we first identify the crucial challenges behind tabular data pretraining, particularly overcoming the cross-table hurdle. As a pioneering endeavor, this work mainly (i)-contributes a high-quality real-world tabular dataset, (ii)-proposes an innovative, generic, and efficient cross-table pretraining framework, dubbed as CM2, where the core to it comprises a semantic-aware tabular neural network that uniformly encodes heterogeneous tables without much restriction and (iii)-introduces a novel pretraining objective -- prompt Masked Table Modeling (pMTM) -- inspired by NLP but intricately tailored to scalable pretraining on tables. Our extensive experiments demonstrate CM2's state-of-the-art performance and validate that cross-table pretraining can enhance various downstream tasks.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (56)
  1. Sercan Ö Arik and Tomas Pfister. 2021. Tabnet: Attentive interpretable tabular learning. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 35. 6679–6687.
  2. Language Models are Realistic Tabular Data Generators. arXiv:2210.06280 [cs.LG]
  3. Webtables: exploring the power of tables on the web. Proceedings of the VLDB Endowment 1, 1 (2008), 538–549.
  4. Anomaly detection: A survey. ACM computing surveys (CSUR) 41, 3 (2009), 1–58.
  5. Curvilinear distance metric learning. Advances in Neural Information Processing Systems 32 (2019).
  6. Tianqi Chen and Carlos Guestrin. 2016. Xgboost: A scalable tree boosting system. In Proceedings of the 22nd acm sigkdd international conference on knowledge discovery and data mining. 785–794.
  7. A simple framework for contrastive learning of visual representations. In International conference on machine learning. PMLR, 1597–1607.
  8. Turl: Table understanding through representation learning. ACM SIGMOD Record 51, 1 (2022), 33–40.
  9. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805 (2018).
  10. Matt W Gardner and SR Dorling. 1998. Artificial neural networks (the multilayer perceptron)—a review of applications in the atmospheric sciences. Atmospheric environment 32, 14-15 (1998), 2627–2636.
  11. Revisiting deep learning models for tabular data. Advances in Neural Information Processing Systems 34 (2021), 18932–18943.
  12. Masked autoencoders are scalable vision learners. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 16000–16009.
  13. TaPas: Weakly supervised table parsing via pre-training. arXiv preprint arXiv:2004.02349 (2020).
  14. TabPFN: A Transformer That Solves Small Tabular Classification Problems in a Second. arXiv:2207.01848 [cs.LG]
  15. Tabtransformer: Tabular data modeling using contextual embeddings. arXiv preprint arXiv:2012.06678 (2020).
  16. GitTables: A Large-Scale Corpus of Relational Tables. Proceedings of the ACM on Management of Data 1, 1 (may 2023), 1–17. https://doi.org/10.1145/3588710
  17. Sherlock: A Deep Learning Approach to Semantic Data Type Detection. arXiv:1905.10688 [cs.LG]
  18. Hyperimpute: Generalized iterative imputation with automatic model selection. In International Conference on Machine Learning. PMLR, 9916–9937.
  19. Lightgbm: A highly efficient gradient boosting decision tree. Advances in neural information processing systems 30 (2017).
  20. Character-aware neural language models. In Proceedings of the AAAI conference on artificial intelligence, Vol. 30.
  21. Diederik P Kingma and Jimmy Ba. 2014. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980 (2014).
  22. Segment anything. arXiv preprint arXiv:2304.02643 (2023).
  23. Igor Kononenko. 2001. Machine learning for medical diagnosis: history, state of the art and perspective. Artificial Intelligence in medicine 23, 1 (2001), 89–109.
  24. Learning a Data-Driven Policy Network for Pre-Training Automated Feature Engineering. In The Eleventh International Conference on Learning Representations.
  25. DIWIFT: Discovering Instance-wise Influential Features for Tabular Data. In Proceedings of the ACM Web Conference 2023. 1673–1682.
  26. PTab: Using the Pre-trained Language Model for Modeling Tabular Data. arXiv preprint arXiv:2209.08060 (2022).
  27. Table-to-text generation by structure-aware seq2seq learning. In Proceedings of the AAAI conference on artificial intelligence, Vol. 32.
  28. AUC: a misleading measure of the performance of predictive distribution models. Global ecology and Biogeography 17, 2 (2008), 145–151.
  29. Catch: Collaborative Feature Set Search for Automated Feature Engineering. In Proceedings of the ACM Web Conference 2023. 1886–1896.
  30. Autocross: Automatic feature crossing for tabular data in real-world applications. In Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining. 1936–1945.
  31. Efficient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781 (2013).
  32. OpenAI. 2022. ChatGPT. https://openai.com/blog/chatgpt.
  33. ToTTo: A controlled table-to-text generation dataset. arXiv preprint arXiv:2004.14373 (2020).
  34. Energy-based Automated Model Evaluation. In The Twelfth International Conference on Learning Representations.
  35. Deepspeed: System optimizations enable training deep learning models with over 100 billion parameters. In Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining. 3505–3506.
  36. Tom Shenkar and Lior Wolf. 2021. Anomaly detection for tabular data with internal contrastive learning. In International Conference on Learning Representations.
  37. Saint: Improved neural networks for tabular data via row attention and contrastive pre-training. arXiv preprint arXiv:2106.01342 (2021).
  38. Autoint: Automatic feature interaction learning via self-attentive neural networks. In Proceedings of the 28th ACM International Conference on Information and Knowledge Management. 1161–1170.
  39. Daniel J Stekhoven and Peter Bühlmann. 2012. MissForest—non-parametric missing value imputation for mixed-type data. Bioinformatics 28, 1 (2012), 112–118.
  40. Tableqa: a large-scale chinese text-to-sql dataset for table-aware sql generation. arXiv preprint arXiv:2006.06434 (2020).
  41. LLaMA: Open and Efficient Foundation Language Models. arXiv:2302.13971 [cs.CL]
  42. Strubert: Structure-aware bert for table search and matching. In Proceedings of the ACM Web Conference 2022. 442–451.
  43. Subtab: Subsetting features of tabular data for self-supervised representation learning. Advances in Neural Information Processing Systems 34 (2021), 18853–18865.
  44. Attention is all you need. Advances in neural information processing systems 30 (2017).
  45. TCN: table convolutional network for web table interpretation. In Proceedings of the Web Conference 2021. 4020–4032.
  46. Pico: Contrastive label disambiguation for partial label learning. In International Conference on Learning Representations.
  47. Dcn v2: Improved deep & cross network and practical lessons for web-scale learning to rank systems. , 1785–1797 pages.
  48. Zifeng Wang and Jimeng Sun. 2022. Transtab: Learning transferable tabular transformers across tables. arXiv preprint arXiv:2205.09328 (2022).
  49. Raymond E Wright. 1995. Logistic regression. (1995).
  50. Simmim: A simple framework for masked image modeling. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 9653–9663.
  51. Modeling Tabular data using Conditional GAN. arXiv:1907.00503 [cs.LG]
  52. TaBERT: Pretraining for Joint Understanding of Textual and Tabular Data. arXiv:2005.08314 [cs.CL]
  53. Vime: Extending the success of self-and semi-supervised learning to tabular domain. Advances in Neural Information Processing Systems 33 (2020), 11033–11043.
  54. Generative Table Pre-training Empowers Models for Tabular Prediction. arXiv:2305.09696 [cs.LG]
  55. Disentangled Causal Embedding With Contrastive Learning For Recommender System. arXiv preprint arXiv:2302.03248 (2023).
  56. XTab: Cross-table Pretraining for Tabular Transformers. arXiv:2305.06090 [cs.LG]
Citations (11)

Summary

We haven't generated a summary for this paper yet.