Why Tabular Foundation Models Should Be a Research Priority (2405.01147v2)
Abstract: Recent text and image foundation models are incredibly impressive, and these models are attracting an ever-increasing portion of research resources. In this position piece we aim to shift the ML research community's priorities ever so slightly to a different modality: tabular data. Tabular data is the dominant modality in many fields, yet it is given hardly any research attention and significantly lags behind in terms of scale and power. We believe the time is now to start developing tabular foundation models, or what we coin a Large Tabular Model (LTM). LTMs could revolutionise the way science and ML use tabular data: not as single datasets that are analyzed in a vacuum, but contextualized with respect to related datasets. The potential impact is far-reaching: from few-shot tabular models to automating data science; from out-of-distribution synthetic data to empowering multidisciplinary scientific discovery. We intend to excite reflections on the modalities we study, and convince some researchers to study large tabular models.
- Compositional Foundation Models for Hierarchical Planning. In Thirty-seventh Conference on Neural Information Processing Systems, 2023. URL https://openreview.net/forum?id=dyXNh5HLq3.
- How Faithful is your Synthetic Data? Sample-level Metrics for Evaluating and Auditing Generative Models. In International Conference on Machine Learning, pp. 290–306, February 2022. URL https://arxiv.org/abs/2102.08921. arXiv: 2102.08921.
- Generalization and Equilibrium in Generative Adversarial Nets (GANs). 34th International Conference on Machine Learning, ICML 2017, 1:322–349, March 2017. doi: 10.48550/arxiv.1703.00573. URL https://arxiv.org/abs/1703.00573v5. arXiv: 1703.00573 Publisher: International Machine Learning Society (IMLS) ISBN: 9781510855144.
- Fast-DetectGPT: Efficient Zero-Shot Detection of Machine-Generated Text via Conditional Probability Curvature. In The Twelfth International Conference on Learning Representations, 2024. URL https://openreview.net/forum?id=Bpcgcr8E8Z.
- TabEL: Entity Linking in Web Tables. In Arenas, M., Corcho, O., Simperl, E., Strohmaier, M., d’Aquin, M., Srinivas, K., Groth, P., Dumontier, M., Heflin, J., Thirunarayan, K., and Staab, S. (eds.), International Semantic Web Conference, volume 9366, pp. 425–441, Cham, 2015. Springer International Publishing. doi: 10.1007/978-3-319-25007-6˙25. URL http://link.springer.com/10.1007/978-3-319-25007-6_25. Book Title: The Semantic Web - ISWC 2015 Series Title: Lecture Notes in Computer Science.
- On the Opportunities and Risks of Foundation Models, July 2022. URL http://arxiv.org/abs/2108.07258. arXiv:2108.07258 [cs].
- Deep neural networks and tabular data: A survey. IEEE Transactions on Neural Networks and Learning Systems, 2022. Publisher: IEEE.
- Language Models are Realistic Tabular Data Generators. In The Eleventh International Conference on Learning Representations, October 2023. ISBN 2210.06280v2. URL https://arxiv.org/abs/2210.06280v2. arXiv: 2210.06280.
- Language Models are Few-Shot Learners. Advances in Neural Information Processing Systems, 2020-December, May 2020. ISSN 10495258. doi: 10.48550/arxiv.2005.14165. URL https://arxiv.org/abs/2005.14165v4. arXiv: 2005.14165 Publisher: Neural information processing systems foundation.
- A Survey of Data Mining and Machine Learning Methods for Cyber Security Intrusion Detection. IEEE Communications Surveys & Tutorials, 18(2):1153–1176, 2016. ISSN 1553-877X. doi: 10.1109/COMST.2015.2494502. URL https://ieeexplore.ieee.org/document/7307098. Conference Name: IEEE Communications Surveys & Tutorials.
- SMOTE: Synthetic Minority Over-sampling Technique. Journal of Artificial Intelligence Research, 16:321–357, 2002.
- XGBoost: A Scalable Tree Boosting System. Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 2016. doi: 10.1145/2939672. URL http://dx.doi.org/10.1145/2939672.2939785. Publisher: ACM Place: New York, NY, USA ISBN: 9781450342322.
- Effectively Unbiased FID and Inception Score and where to find them. Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, pp. 6069–6078, November 2019. ISSN 10636919. doi: 10.1109/CVPR42600.2020.00611. URL https://arxiv.org/abs/1911.07023v3. arXiv: 1911.07023 Publisher: IEEE Computer Society.
- The Gene Expression Omnibus Database. In Mathé, E. and Davis, S. (eds.), Statistical Genomics: Methods and Protocols, Methods in Molecular Biology, pp. 93–110. Springer, New York, NY, 2016. ISBN 978-1-4939-3578-9. doi: 10.1007/978-1-4939-3578-9˙5. URL https://doi.org/10.1007/978-1-4939-3578-9_5.
- Statistical and machine learning models in credit scoring: A systematic literature survey. Applied Soft Computing, 91:106263, June 2020. ISSN 1568-4946. doi: 10.1016/j.asoc.2020.106263. URL https://www.sciencedirect.com/science/article/pii/S1568494620302039.
- TURL: Table Understanding through Representation Learning. SIGMOD Record, 51(1):33–40, June 2020. ISSN 01635808. doi: 10.1145/3542700.3542709. URL https://arxiv.org/abs/2006.14806v2. arXiv: 2006.14806 Publisher: Association for Computing Machinery.
- Calibration of Pre-trained Transformers, October 2020. URL http://arxiv.org/abs/2003.07892. arXiv:2003.07892 [cs].
- BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding, May 2019. URL http://arxiv.org/abs/1810.04805. arXiv:1810.04805 [cs].
- LIFT: Language-Interfaced Fine-Tuning for Non-Language Machine Learning Tasks, October 2022. URL http://arxiv.org/abs/2206.06565. arXiv:2206.06565 [cs].
- The Algorithmic Foundations of Differential Privacy. Foundations and Trends in Theoretical Computer Science, 9(3-4):211–487, August 2014. ISSN 15513068. doi: 10.1561/0400000042. URL https://dl.acm.org/doi/abs/10.1561/0400000042. Publisher: Now Publishers Inc. PUB4850 Hanover, MA, USA.
- TabLib: A Dataset of 627M Tables with Context, October 2023. URL https://arxiv.org/abs/2310.07875v1. arXiv: 2310.07875 ISBN: 2310.07875v1.
- Survey of Machine Learning Algorithms for Disease Diagnostic. Journal of Intelligent Learning Systems and Applications, 09(01):1, 2017. doi: 10.4236/jilsa.2017.91001. URL http://www.scirp.org/journal/PaperInformation.aspx?PaperID=73781&#abstract. Number: 01 Publisher: Scientific Research Publishing.
- Subgroup Robustness Grows On Trees: An Empirical Baseline Investigation. Advances in Neural Information Processing Systems, 35:9939–9954, November 2022. doi: 10.48550/arxiv.2211.12703. URL https://arxiv.org/abs/2211.12703v1. arXiv: 2211.12703.
- Benchmarking Distribution Shift in Tabular Data with TableShift, February 2024. URL http://arxiv.org/abs/2312.07577. arXiv:2312.07577 [cs].
- Gates, B. The Age of AI has begun, March 2023. URL https://www.gatesnotes.com/The-Age-of-AI-Has-Begun?WT.mc_id=20230321100000_Artificial-Intelligence_BG-LI_&WT.tsrc=BGLI.
- xVal: A Continuous Number Encoding for Large Language Models, October 2023. URL http://arxiv.org/abs/2310.02989. arXiv:2310.02989 [cs, stat].
- Revisiting Deep Learning Models for Tabular Data. Advances in Neural Information Processing Systems, 34:18932–18943, 2021.
- Why do tree-based models still outperform deep learning on tabular data? In Advances in Neural Information Processing, July 2022. URL https://arxiv.org/abs/2207.08815v1. arXiv: 2207.08815.
- Textbooks Are All You Need, October 2023. URL http://arxiv.org/abs/2306.11644. arXiv:2306.11644 [cs].
- Haidich, A. B. Meta-analysis in medical research. Hippokratia, 14(Suppl 1):29–37, December 2010. ISSN 1108-4189. URL https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3049418/.
- Hedonic housing prices and the demand for clean air. Journal of Environmental Economics and Management, 5(1):81–102, March 1978. ISSN 0095-0696. doi: 10.1016/0095-0696(78)90006-2. URL https://www.sciencedirect.com/science/article/pii/0095069678900062.
- Denoising Diffusion Probabilistic Models. Advances in Neural Information Processing Systems, 2020-December, June 2020. ISSN 10495258. doi: 10.48550/arxiv.2006.11239. URL https://arxiv.org/abs/2006.11239v2. arXiv: 2006.11239 Publisher: Neural information processing systems foundation.
- An empirical analysis of compute-optimal large language model training. Advances in Neural Information Processing Systems, 35:30016–30030, December 2022. URL https://proceedings.neurips.cc/paper_files/paper/2022/hash/c1e2faff6f588870935f114ebe04a3e5-Abstract-Conference.html.
- TabPFN: A Transformer That Solves Small Tabular Classification Problems in a Second. In The Eleventh International Conference on Learning Representations, September 2023a. doi: 10.48550/arXiv.2207.01848. URL http://arxiv.org/abs/2207.01848. arXiv:2207.01848 [cs, stat].
- Large Language Models for Automated Data Science: Introducing CAAFE for Context-Aware Automated Feature Engineering. In Thirty-seventh Conference on Neural Information Processing Systems, 2023b. URL https://openreview.net/forum?id=9WSxQZ9mG7.
- TabTransformer: Tabular Data Modeling Using Contextual Embeddings, December 2020. URL https://arxiv.org/abs/2012.06678v1. arXiv: 2012.06678.
- Survey of Hallucination in Natural Language Generation. ACM Computing Surveys, 55(12), March 2023. ISSN 15577341. doi: 10.1145/3571730. URL https://dl.acm.org/doi/10.1145/3571730. arXiv: 2202.03629 Publisher: ACM PUB27 New York, NY.
- Learning Numeral Embedding. In Cohn, T., He, Y., and Liu, Y. (eds.), Findings of the Association for Computational Linguistics: EMNLP 2020, pp. 2586–2599, Online, November 2020. Association for Computational Linguistics. doi: 10.18653/v1/2020.findings-emnlp.235. URL https://aclanthology.org/2020.findings-emnlp.235.
- How Can We Know When Language Models Know? On the Calibration of Language Models for Question Answering. Transactions of the Association for Computational Linguistics, 9:962–977, September 2021. ISSN 2307-387X. doi: 10.1162/tacl˙a˙00407. URL https://doi.org/10.1162/tacl_a_00407.
- Generative Calibration for In-context Learning. In Bouamor, H., Pino, J., and Bali, K. (eds.), Findings of the Association for Computational Linguistics: EMNLP 2023, pp. 2312–2333, Singapore, December 2023. Association for Computational Linguistics. doi: 10.18653/v1/2023.findings-emnlp.152. URL https://aclanthology.org/2023.findings-emnlp.152.
- Synthetic Data – what, why and how?, May 2022. URL https://arxiv.org/abs/2205.03257v1. arXiv: 2205.03257.
- Well-tuned Simple Nets Excel on Tabular Datasets, November 2021. URL http://arxiv.org/abs/2106.11189. arXiv:2106.11189 [cs].
- Kaggle. 2017 Kaggle Machine Learning & Data Science Survey, 2017. URL https://www.kaggle.com/datasets/kaggle/kaggle-survey-2017.
- Scaling Laws for Neural Language Models, January 2020. URL http://arxiv.org/abs/2001.08361. arXiv:2001.08361 [cs, stat].
- ChatGPT for good? On opportunities and challenges of large language models for education. Learning and Individual Differences, 103:102274, April 2023. ISSN 1041-6080. doi: 10.1016/J.LINDIF.2023.102274. Publisher: JAI.
- Kolesnikov, S. Wild-Tab: A Benchmark For Out-Of-Distribution Generalization In Tabular Regression, December 2023. URL http://arxiv.org/abs/2312.01792. arXiv:2312.01792 [cs].
- OpenTab: Advancing Large Language Models as Open-domain Table Reasoners. In The Twelfth International Conference on Learning Representations, 2024. URL https://openreview.net/forum?id=Qa0ULgosc9.
- Improved Precision and Recall Metric for Assessing Generative Models. In Advances in Neural Information Processing Systems, volume 32, 2019. URL https://github.com/kynkaat/improved-precision-and-recall-metric.
- A Large Public Corpus of Web Tables containing Time and Context Metadata. In Proceedings of the 25th International Conference Companion on World Wide Web - WWW ’16 Companion, pp. 75–76, Montréal, Québec, Canada, 2016. ACM Press. ISBN 978-1-4503-4144-8. doi: 10.1145/2872518.2889386. URL http://dl.acm.org/citation.cfm?doid=2872518.2889386.
- Multimodal Foundation Models: From Specialists to General-Purpose Assistants, September 2023a. URL http://arxiv.org/abs/2309.10020. arXiv:2309.10020 [cs].
- SheetCopilot: Bringing Software Productivity to the Next Level through Large Language Models. In Thirty-seventh Conference on Neural Information Processing Systems, 2023b. URL https://openreview.net/forum?id=tfyr2zRVoK.
- Chain-of-Knowledge: Grounding Large Language Models via Dynamic Knowledge Adapting over Heterogeneous Sources. In The Twelfth International Conference on Learning Representations, 2024. URL https://openreview.net/forum?id=cPgh4gWZlz.
- On the Need for a Language Describing Distribution Shifts: Illustrations on Tabular Datasets, July 2023. URL http://arxiv.org/abs/2307.05284. arXiv:2307.05284 [cs].
- An Improved Evaluation Framework for Generative Adversarial Networks, July 2018. URL http://arxiv.org/abs/1803.07474. arXiv:1803.07474 [cs].
- Language Models are Weak Learners. In Thirty-seventh Conference on Neural Information Processing Systems, 2023. URL https://openreview.net/forum?id=559NJBfN20.
- Marcus, G. Why Are We Letting the AI Crisis Just Happen?, March 2023. URL https://www.theatlantic.com/technology/archive/2023/03/ai-chatbots-large-language-model-misinformation/673376/. Section: Technology.
- When Do Neural Nets Outperform Boosted Trees on Tabular Data?, October 2023. URL http://arxiv.org/abs/2305.02997. arXiv:2305.02997 [cs, stat].
- MTEB: Massive Text Embedding Benchmark, March 2023. URL http://arxiv.org/abs/2210.07316. arXiv:2210.07316 [cs].
- Reliable Fidelity and Diversity Metrics for Generative Models. In III, H. D. and Singh, A. (eds.), Proceedings of the 37th International Conference on Machine Learning, volume 119, pp. 7176–7185. PMLR, June 2020. URL http://proceedings.mlr.press/v119/naeem20a.html. Series Title: Proceedings of Machine Learning Research.
- LEVER: Learning to Verify Language-to-Code Generation with Execution. In International Conference on Machine Learning, September 2023. doi: 10.48550/arXiv.2302.08468. URL http://arxiv.org/abs/2302.08468. arXiv:2302.08468 [cs].
- Office for National Statistics. 2021 Census, 2021. URL https://www.ons.gov.uk/census.
- Tabular Transformers for Modeling Multivariate Time Series. In ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 3565–3569, June 2021. doi: 10.1109/ICASSP39728.2021.9414142. URL https://ieeexplore.ieee.org/abstract/document/9414142?casa_token=FJ8ftI46EKUAAAAA:-ensrUQ1Jmtfpu1tAw2PsfLN3eR3sZOwV9dHgJ-5l-a-IjqZC0Vb_e5guzWKhri5Pic0xOIc8-A. ISSN: 2379-190X.
- CABINET: Content Relevance-based Noise Reduction for Table Question Answering. In The Twelfth International Conference on Learning Representations, 2024. URL https://openreview.net/forum?id=SQrHpTllXa.
- Learning Transferable Visual Models From Natural Language Supervision. In Proceedings of the 38th International Conference on Machine Learning, volume 139, pp. 8748–8763. PMLR, February 2021. doi: 10.48550/arxiv.2103.00020. URL https://arxiv.org/abs/2103.00020v1. arXiv: 2103.00020.
- Deepfake Detection: A Systematic Literature Review. IEEE Access, 10:25494–25513, 2022. ISSN 2169-3536. doi: 10.1109/ACCESS.2022.3154404. URL https://ieeexplore.ieee.org/abstract/document/9721302. Conference Name: IEEE Access.
- Can LLMs Generate Random Numbers? Evaluating LLM Sampling in Controlled Domains. In ICML 2023 Workshop: Sampling and Optimization in Discrete Space, 2023. URL http://people.csail.mit.edu/renda/llm-sampling-paper.
- High-Resolution Image Synthesis with Latent Diffusion Models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10684–10695, December 2022. doi: 10.48550/arxiv.2112.10752. URL https://arxiv.org/abs/2112.10752v2. arXiv: 2112.10752.
- Photorealistic Text-to-Image Diffusion Models with Deep Language Understanding. Advances in Neural Information Processing Systems, 35:36479–36494, May 2022. URL https://arxiv.org/abs/2205.11487v1. arXiv: 2205.11487.
- Assessing Generative Models via Precision and Recall. In Bengio, S., Wallach, H., Larochelle, H., Grauman, K., Cesa-Bianchi, N., and Garnett, R. (eds.), Advances in Neural Information Processing Systems, volume 31. Curran Associates, Inc., 2018.
- LAION-5B: An open large-scale dataset for training next generation image-text models. Advances in Neural Information Processing Systems, 35:25278–25294, December 2022. URL https://proceedings.neurips.cc/paper_files/paper/2022/hash/a1859debfb3b59d094f3504d5ebb6c25-Abstract-Datasets_and_Benchmarks.html.
- Tabular data: Deep learning is not all you need. Information Fusion, 81:84–90, May 2022. ISSN 1566-2535. doi: 10.1016/J.INFFUS.2021.11.011. arXiv: 2106.03253 Publisher: Elsevier.
- REaLTabFormer: Generating Realistic Relational and Tabular Data using Transformers, February 2023. URL https://arxiv.org/abs/2302.02041v1. arXiv: 2302.02041.
- SAINT: Improved neural networks for tabular data via row attention and contrastive pre-training. arXiv preprint arXiv:2106.01342, 2021.
- Generative Modeling by Estimating Gradients of the Data Distribution. Advances in Neural Information Processing Systems, 32, 2019.
- Numeracy for Language Models: Evaluating and Improving their Ability to Predict Numbers. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 2104–2115, 2018. doi: 10.18653/v1/P18-1196. URL http://arxiv.org/abs/1805.08154. arXiv:1805.08154 [cs, stat].
- Representing Numbers in NLP: a Survey and a Vision, March 2021. URL http://arxiv.org/abs/2103.13136. arXiv:2103.13136 [cs].
- A note on the evaluation of generative models. 4th International Conference on Learning Representations, ICLR 2016 - Conference Track Proceedings, November 2015. URL https://arxiv.org/abs/1511.01844.
- Publication bias in meta-analysis: its causes and consequences. Journal of Clinical Epidemiology, 53(2):207–216, February 2000. ISSN 0895-4356. doi: 10.1016/S0895-4356(99)00161-4. URL https://www.sciencedirect.com/science/article/pii/S0895435699001614.
- Llama 2: Open Foundation and Fine-Tuned Chat Models, July 2023. URL http://arxiv.org/abs/2307.09288. arXiv:2307.09288 [cs].
- Generating high-fidelity synthetic patient data for assessing machine learning healthcare software. npj Digital Medicine 2020 3:1, 3(1):1–13, November 2020. ISSN 2398-6352. doi: 10.1038/s41746-020-00353-9. URL https://www.nature.com/articles/s41746-020-00353-9. Publisher: Nature Publishing Group.
- Beyond Privacy: Navigating the Opportunities and Challenges of Synthetic Data, April 2023. URL http://arxiv.org/abs/2304.03722. arXiv:2304.03722 [cs].
- Can You Rely on Your Model Evaluation? Improving Model Evaluation with Synthetic Test Data. In Advances in Neural Information Processing (NeurIPS 2023), October 2023a. URL https://arxiv.org/abs/2310.16524v1. arXiv: 2310.16524.
- Membership Inference Attacks against Synthetic Data through Overfitting Detection. In Proceedings of the 26th International Conference on Artificial Intelligence and Statistics, 2023b.
- Learning from synthetic data for crowd counting in the wild. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 8198–8207, 2019.
- Describe, Explain, Plan and Select: Interactive Planning with LLMs Enables Open-World Multi-Task Agents. In Thirty-seventh Conference on Neural Information Processing Systems, 2023. URL https://openreview.net/forum?id=KtvPdGb31Z.
- Chain-of-Table: Evolving Tables in the Reasoning Chain for Table Understanding. In The Twelfth International Conference on Learning Representations, 2024. URL https://openreview.net/forum?id=4L0xnS4GQM.
- UniTabE: A Universal Pretraining Protocol for Tabular Foundation Model in Data Science. In The Twelfth International Conference on Learning Representations, October 2023. URL https://openreview.net/forum?id=6LLho5X6xV.
- Towards Cross-Table Masked Pretraining for Web Data Mining. In Proceedings of the ACM Web Conference 2024 (WWW ’24). ACM, July 2023. URL https://arxiv.org/abs/2307.04308. arXiv: 2307.04308.
- TaBERT: Pretraining for Joint Understanding of Textual and Tabular Data. Proceedings of the Annual Meeting of the Association for Computational Linguistics, pp. 8413–8426, May 2020. ISSN 0736587X. doi: 10.18653/v1/2020.acl-main.745. URL https://arxiv.org/abs/2005.08314v1. arXiv: 2005.08314 Publisher: Association for Computational Linguistics (ACL) ISBN: 9781952148255.
- RadialGAN: Leveraging multiple datasets to improve target-specific predictive models using Generative Adversarial Networks. 35th International Conference on Machine Learning, ICML 2018, 13:9060–9068, February 2018. URL https://arxiv.org/abs/1802.06403v2. arXiv: 1802.06403 Publisher: International Machine Learning Society (IMLS) ISBN: 9781510867963.
- Anonymization through data synthesis using generative adversarial networks (ADS-GAN). IEEE Journal of Biomedical and Health Informatics, 24(8):2378–2388, August 2020. ISSN 21682208. doi: 10.1109/JBHI.2020.2980262. Publisher: Institute of Electrical and Electronics Engineers Inc.
- CRAFT: Customizing LLMs by Creating and Retrieving from Specialized Toolsets. In The Twelfth International Conference on Learning Representations, 2024. URL https://openreview.net/forum?id=G0vdDSt9XM.
- Towards Foundation Models for Learning on Tabular Data, October 2023. URL http://arxiv.org/abs/2310.07338. arXiv:2310.07338 [cs].
- TabuLa: Harnessing Language Models for Tabular Data Synthesis. Proceedings of ACM Conference (Conference’17), 1, October 2023. URL http://arxiv.org/abs/2310.12746. arXiv: 2310.12746.
- Towards Optimal Caching and Model Selection for Large Model Inference. In Thirty-seventh Conference on Neural Information Processing Systems, 2023a. URL https://openreview.net/forum?id=gd20oaZqqF.
- XTab: Cross-table Pretraining for Tabular Transformers. In International Conference on Machine Learning, May 2023b. URL https://arxiv.org/abs/2305.06090v1. arXiv: 2305.06090.
- Unpaired Image-to-Image Translation using Cycle-Consistent Adversarial Networks. Proceedings of the IEEE International Conference on Computer Vision, 2017-October:2242–2251, March 2017. ISSN 15505499. doi: 10.1109/ICCV.2017.244. URL https://arxiv.org/abs/1703.10593v7. arXiv: 1703.10593 Publisher: Institute of Electrical and Electronics Engineers Inc. ISBN: 9781538610329.
Sponsor
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.
Top Community Prompts
Collections
Sign up for free to add this paper to one or more collections.