Papers
Topics
Authors
Recent
Assistant
AI Research Assistant
Well-researched responses based on relevant abstracts and paper content.
Custom Instructions Pro
Preferences or requirements that you'd like Emergent Mind to consider when generating responses.
Gemini 2.5 Flash
Gemini 2.5 Flash 167 tok/s
Gemini 2.5 Pro 53 tok/s Pro
GPT-5 Medium 31 tok/s Pro
GPT-5 High 38 tok/s Pro
GPT-4o 104 tok/s Pro
Kimi K2 191 tok/s Pro
GPT OSS 120B 433 tok/s Pro
Claude Sonnet 4.5 37 tok/s Pro
2000 character limit reached

Large Language Models(LLMs) on Tabular Data: Prediction, Generation, and Understanding -- A Survey (2402.17944v4)

Published 27 Feb 2024 in cs.CL

Abstract: Recent breakthroughs in large language modeling have facilitated rigorous exploration of their application in diverse tasks related to tabular data modeling, such as prediction, tabular data synthesis, question answering, and table understanding. Each task presents unique challenges and opportunities. However, there is currently a lack of comprehensive review that summarizes and compares the key techniques, metrics, datasets, models, and optimization approaches in this research domain. This survey aims to address this gap by consolidating recent progress in these areas, offering a thorough survey and taxonomy of the datasets, metrics, and methodologies utilized. It identifies strengths, limitations, unexplored territories, and gaps in the existing literature, while providing some insights for future research directions in this vital and rapidly evolving field. It also provides relevant code and datasets references. Through this comprehensive review, we hope to provide interested readers with pertinent references and insightful perspectives, empowering them with the necessary tools and knowledge to effectively navigate and address the prevailing challenges in the field.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (201)
  1. Tablequery: Querying tabular data with natural language. CoRR, abs/2202.00454, 2022. URL https://arxiv.org/abs/2202.00454.
  2. Exploring the numerical reasoning capabilities of language models: A comprehensive analysis on tabular data. In Houda Bouamor, Juan Pino, and Kalika Bali (eds.), Findings of the Association for Computational Linguistics: EMNLP 2023, Singapore, December 6-10, 2023, pp.  15391–15405. Association for Computational Linguistics, 2023. URL https://aclanthology.org/2023.findings-emnlp.1028.
  3. FEVEROUS: fact extraction and verification over unstructured and structured information. In Joaquin Vanschoren and Sai-Kit Yeung (eds.), Proceedings of the Neural Information Processing Systems Track on Datasets and Benchmarks 1, NeurIPS Datasets and Benchmarks 2021, December 2021, virtual, 2021. URL https://datasets-benchmarks-proceedings.neurips.cc/paper/2021/hash/68d30a9594728bc39aa24be94b319d21-Abstract-round1.html.
  4. Tabnet: Attentive interpretable tabular learning, 2020.
  5. Transformers for tabular data representation: A survey of models and applications. Transactions of the Association for Computational Linguistics, 11:227–249, 2023. doi: 10.1162/tacl_a_00544. URL https://aclanthology.org/2023.tacl-1.14.
  6. Synthesizing electronic health records using improved generative adversarial networks. Journal of the American Medical Informatics Association, 26(3):228–241, 2019.
  7. Improving deep neural network generalization and robustness to background bias via layer-wise relevance propagation optimization. Nature Communications, 15(1):291, 2024.
  8. Multimodal llms for health grounded in individual-specific data. In Workshop on Machine Learning for Multimodal Healthcare Data, pp.  86–102. Springer, 2023.
  9. A neural probabilistic language model. Advances in neural information processing systems, 13, 2000.
  10. Gpt-neox-20b: An open-source autoregressive language model, 2022.
  11. Deep neural networks and tabular data: A survey. IEEE Transactions on Neural Networks and Learning Systems, 2022.
  12. Language models are realistic tabular data generators. In The Eleventh International Conference on Learning Representations, ICLR 2023, Kigali, Rwanda, May 1-5, 2023. OpenReview.net, 2023a. URL https://openreview.net/pdf?id=cEygmQNOeI.
  13. Language models are realistic tabular data generators. In The Eleventh International Conference on Learning Representations, 2023b. URL https://openreview.net/forum?id=cEygmQNOeI.
  14. Language models are few-shot learners, 2020.
  15. Arm-net: Adaptive relation modeling network for structured data. In Proceedings of the 2021 International Conference on Management of Data, SIGMOD/PODS ’21. ACM, June 2021. doi: 10.1145/3448016.3457321. URL http://dx.doi.org/10.1145/3448016.3457321.
  16. A survey on evaluation of large language models. ACM Trans. Intell. Syst. Technol., jan 2024. ISSN 2157-6904. doi: 10.1145/3641289. URL https://doi.org/10.1145/3641289.
  17. Bridge the gap between language models and tabular understanding. CoRR, abs/2302.09302, 2023a. doi: 10.48550/ARXIV.2302.09302. URL https://doi.org/10.48550/arXiv.2302.09302.
  18. Xgboost: A scalable tree boosting system. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD ’16, pp.  785–794, New York, NY, USA, 2016. Association for Computing Machinery. ISBN 9781450342322. doi: 10.1145/2939672.2939785. URL https://doi.org/10.1145/2939672.2939785.
  19. Wenhu Chen. Large language models are few(1)-shot table reasoners. In Andreas Vlachos and Isabelle Augenstein (eds.), Findings of the Association for Computational Linguistics: EACL 2023, Dubrovnik, Croatia, May 2-6, 2023, pp.  1090–1100. Association for Computational Linguistics, 2023. doi: 10.18653/V1/2023.FINDINGS-EACL.83. URL https://doi.org/10.18653/v1/2023.findings-eacl.83.
  20. Tabfact : A large-scale dataset for table-based fact verification. In International Conference on Learning Representations (ICLR), Addis Ababa, Ethiopia, April 2020a.
  21. Program of thoughts prompting: Disentangling computation from reasoning for numerical reasoning tasks. CoRR, abs/2211.12588, 2022. doi: 10.48550/ARXIV.2211.12588. URL https://doi.org/10.48550/arXiv.2211.12588.
  22. Phoenix: Democratizing chatgpt across languages. CoRR, abs/2304.10453, 2023b. doi: 10.48550/ARXIV.2304.10453. URL https://doi.org/10.48550/arXiv.2304.10453.
  23. Table search using a deep contextualized language model. In Proceedings of the 43rd International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR ’20. ACM, July 2020b. doi: 10.1145/3397271.3401044. URL http://dx.doi.org/10.1145/3397271.3401044.
  24. Wide & deep learning for recommender systems, 2016.
  25. HiTab: A hierarchical table dataset for question answering and natural language generation. In Smaranda Muresan, Preslav Nakov, and Aline Villavicencio (eds.), Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp.  1094–1110, Dublin, Ireland, May 2022. Association for Computational Linguistics. doi: 10.18653/v1/2022.acl-long.78. URL https://aclanthology.org/2022.acl-long.78.
  26. Binding language models in symbolic languages. In The Eleventh International Conference on Learning Representations, ICLR 2023, Kigali, Rwanda, May 1-5, 2023. OpenReview.net, 2023. URL https://openreview.net/pdf?id=lH1PV42cbF.
  27. Generating multi-label discrete patient records using generative adversarial networks, 2018.
  28. Scaling instruction-finetuned language models, 2022.
  29. Prototypical verbalizer for prompt-based few-shot tuning, 2022.
  30. Synthesising multi-modal minority samples for tabular data, 2021.
  31. The magellan data repository. https://sites.google.com/site/anhaidgroup/projects/data.
  32. TURL: table understanding through representation learning. SIGMOD Rec., 51(1):33–40, 2022a. doi: 10.1145/3542700.3542709. URL https://doi.org/10.1145/3542700.3542709.
  33. PACIFIC: towards proactive conversational question answering over tabular and textual data in finance. In Yoav Goldberg, Zornitsa Kozareva, and Yue Zhang (eds.), Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, EMNLP 2022, Abu Dhabi, United Arab Emirates, December 7-11, 2022, pp.  6970–6984. Association for Computational Linguistics, 2022b. doi: 10.18653/V1/2022.EMNLP-MAIN.469. URL https://doi.org/10.18653/v1/2022.emnlp-main.469.
  34. BERT: Pre-training of deep bidirectional transformers for language understanding. In Jill Burstein, Christy Doran, and Thamar Solorio (eds.), Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pp.  4171–4186, Minneapolis, Minnesota, June 2019. Association for Computational Linguistics. doi: 10.18653/v1/N19-1423. URL https://aclanthology.org/N19-1423.
  35. Parameter-efficient fine-tuning of large-scale pre-trained language models. Nature Machine Intelligence, 5(3):220–235, 2023.
  36. Lift: Language-interfaced fine-tuning for non-language machine learning tasks. In Advances in Neural Information Processing Systems, 2022.
  37. C3: Zero-shot text-to-sql with chatgpt, 2023.
  38. Building the dresden web table corpus: A classification approach. In Ioan Raicu, Omer F. Rana, and Rajkumar Buyya (eds.), 2nd IEEE/ACM International Symposium on Big Data Computing, BDC 2015, Limassol, Cyprus, December 7-10, 2015, pp.  41–50. IEEE Computer Society, 2015. doi: 10.1109/BDC.2015.30. URL https://doi.org/10.1109/BDC.2015.30.
  39. How large language models will disrupt data management. Proc. VLDB Endow., 16(11):3302–3309, 2023. doi: 10.14778/3611479.3611527. URL https://www.vldb.org/pvldb/vol16/p3302-fernandez.pdf.
  40. Hao Fu, Yao; Peng and Tushar Khot. How does gpt obtain its ability? tracing emergent abilities of language models to their sources. Yao Fu’s Notion, Dec 2022. URL https://yaofu.notion.site/How-does-GPT-Obtain-its-Ability-Tracing-Emergent-Abilities-of-Language-Models-to-their-Sources-b9a57ac0fcf74f30a1ab9e3e36fa1dc1.
  41. Text-to-sql empowered by large language models: A benchmark evaluation, 2023.
  42. TableGPT: Few-shot table-to-text generation with table structure reconstruction and content matching. In Donia Scott, Nuria Bel, and Chengqing Zong (eds.), Proceedings of the 28th International Conference on Computational Linguistics, pp.  1978–1988, Barcelona, Spain (Online), December 2020. International Committee on Computational Linguistics. doi: 10.18653/v1/2020.coling-main.179. URL https://aclanthology.org/2020.coling-main.179.
  43. Deep learning. MIT press, 2016.
  44. Revisiting deep learning models for tabular data. Advances in Neural Information Processing Systems, 34:18932–18943, 2021.
  45. Why do tree-based models still outperform deep learning on tabular data?, 2022.
  46. Large language models are zero-shot time series forecasters. arXiv preprint arXiv:2310.07820, 2023.
  47. TabMT: Generating tabular data with masked transformers. In Thirty-seventh Conference on Neural Information Processing Systems, 2023. URL https://openreview.net/forum?id=qs4swxtIAQ.
  48. Entity embeddings of categorical variables, 2016.
  49. Deepfm: A factorization-machine based neural network for ctr prediction, 2017.
  50. INFOTABS: Inference on tables as semi-structured data. In Dan Jurafsky, Joyce Chai, Natalie Schluter, and Joel Tetreault (eds.), Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pp.  2309–2324, Online, July 2020. Association for Computational Linguistics. doi: 10.18653/v1/2020.acl-main.210. URL https://aclanthology.org/2020.acl-main.210.
  51. Tabllm: Few-shot classification of tabular data with large language models. In Francisco J. R. Ruiz, Jennifer G. Dy, and Jan-Willem van de Meent (eds.), International Conference on Artificial Intelligence and Statistics, 25-27 April 2023, Palau de Congressos, Valencia, Spain, volume 206 of Proceedings of Machine Learning Research, pp.  5549–5581. PMLR, 2023. URL https://proceedings.mlr.press/v206/hegselmann23a.html.
  52. Synthetic data generation for tabular health records: A systematic review. Neurocomputing, 493:28–45, 2022.
  53. Tapas: Weakly supervised table parsing via pre-training. In Dan Jurafsky, Joyce Chai, Natalie Schluter, and Joel R. Tetreault (eds.), Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, ACL 2020, Online, July 5-10, 2020, pp.  4320–4333. Association for Computational Linguistics, 2020. doi: 10.18653/V1/2020.ACL-MAIN.398. URL https://doi.org/10.18653/v1/2020.acl-main.398.
  54. Open domain question answering over tables via dense retrieval. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp.  512–519, 2021.
  55. A survey on hallucination in large language models: Principles, taxonomy, challenges, and open questions. CoRR, abs/2311.05232, 2023. doi: 10.48550/ARXIV.2311.05232. URL https://doi.org/10.48550/arXiv.2311.05232.
  56. Tabtransformer: Tabular data modeling using contextual embeddings, 2020.
  57. TABBIE: Pretrained representations of tabular data. In Kristina Toutanova, Anna Rumshisky, Luke Zettlemoyer, Dilek Hakkani-Tur, Iz Beltagy, Steven Bethard, Ryan Cotterell, Tanmoy Chakraborty, and Yichao Zhou (eds.), Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp.  3446–3456, Online, June 2021. Association for Computational Linguistics. doi: 10.18653/v1/2021.naacl-main.270. URL https://aclanthology.org/2021.naacl-main.270.
  58. Boost then convolve: Gradient boosting meets graph neural networks, 2021.
  59. Unsupervised dense information retrieval with contrastive learning. Trans. Mach. Learn. Res., 2022, 2022. URL https://openreview.net/forum?id=jKN1pXi7b0.
  60. Towards better serialization of tabular data for few-shot classification with large language models, 2023.
  61. StructGPT: A general framework for large language model to reason over structured data. In Houda Bouamor, Juan Pino, and Kalika Bali (eds.), Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pp.  9237–9251, Singapore, December 2023. Association for Computational Linguistics. doi: 10.18653/v1/2023.emnlp-main.574. URL https://aclanthology.org/2023.emnlp-main.574.
  62. Time-llm: Time series forecasting by reprogramming large language models. arXiv preprint arXiv:2310.01728, 2023a.
  63. Large models for time series and spatio-temporal data: A survey and outlook. arXiv preprint arXiv:2310.10196, 2023b.
  64. A survey on table question answering: Recent advances, 2022.
  65. TabPrompt: Graph-based pre-training and prompting for few-shot table understanding. In Houda Bouamor, Juan Pino, and Kalika Bali (eds.), Findings of the Association for Computational Linguistics: EMNLP 2023, pp.  7373–7383, Singapore, December 2023c. Association for Computational Linguistics. doi: 10.18653/v1/2023.findings-emnlp.493. URL https://aclanthology.org/2023.findings-emnlp.493.
  66. Generating and imputing tabular data via diffusion and flow-based gradient-boosted trees, 2023.
  67. Well-tuned simple nets excel on tabular datasets, 2021.
  68. AIT-QA: Question answering dataset over complex tables in the airline industry. In Anastassia Loukina, Rashmi Gangadharaiah, and Bonan Min (eds.), Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies: Industry Track, pp.  305–314, Hybrid: Seattle, Washington + Online, July 2022. Association for Computational Linguistics. doi: 10.18653/v1/2022.naacl-industry.34. URL https://aclanthology.org/2022.naacl-industry.34.
  69. Net-dnf: Effective deep modeling of tabular data. In International conference on learning representations, 2020.
  70. Deepgbm: A deep learning framework distilled by gbdt for online prediction tasks. In Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, pp.  384–394, 2019a.
  71. TabNN: A universal neural network solution for tabular data, 2019b. URL https://openreview.net/forum?id=r1eJssCqY7.
  72. Unifiedqa-v2: Stronger generalization via broader cross-format training. arXiv preprint arXiv:2202.12359, 2022.
  73. Stasy: Score-based tabular data synthesis. arXiv preprint arXiv:2210.04018, 2022a.
  74. Sos: Score-based oversampling for tabular data. In Proceedings of the 28th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, pp.  762–772, 2022b.
  75. Character-aware neural language models. In Proceedings of the AAAI conference on artificial intelligence, volume 30, 2016.
  76. 1d convolutional neural networks and applications: A survey, 2019.
  77. Self-attention between datapoints: Going beyond individual input-output pairs in deep learning, 2022.
  78. Tabddpm: Modelling tabular data with diffusion models, 2022.
  79. Codi: Co-evolving contrastive diffusion models for mixed-type tabular synthesis. arXiv preprint arXiv:2304.12654, 2023.
  80. Biobert: a pre-trained biomedical language representation model for biomedical text mining. Bioinformatics, 36(4):1234–1240, September 2019. ISSN 1367-4811. doi: 10.1093/bioinformatics/btz682. URL http://dx.doi.org/10.1093/bioinformatics/btz682.
  81. Ctrl: Connect collaborative and language model for ctr prediction, 2023.
  82. Sync: A copula based framework for generating synthetic data from aggregated sources, 2020.
  83. Ptab: Using the pre-trained language model for modeling tabular data, 2022a.
  84. Prompt valuation based on shapley values. CoRR, abs/2312.15395, 2023a. doi: 10.48550/ARXIV.2312.15395. URL https://doi.org/10.48550/arXiv.2312.15395.
  85. Few-shot parameter-efficient fine-tuning is better and cheaper than in-context learning. Advances in Neural Information Processing Systems, 35:1950–1965, 2022b.
  86. Lost in the middle: How language models use long contexts. CoRR, abs/2307.03172, 2023b. doi: 10.48550/ARXIV.2307.03172. URL https://doi.org/10.48550/arXiv.2307.03172.
  87. TAPEX: table pre-training via learning a neural SQL executor. In The Tenth International Conference on Learning Representations, ICLR 2022, Virtual Event, April 25-29, 2022. OpenReview.net, 2022c. URL https://openreview.net/forum?id=O50443AsCP.
  88. Jarvix: A LLM no code platform for tabular data analysis and optimization. In Mingxuan Wang and Imed Zitouni (eds.), Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing: EMNLP 2023 - Industry Track, Singapore, December 6-10, 2023, pp.  622–630. Association for Computational Linguistics, 2023c. URL https://aclanthology.org/2023.emnlp-industry.59.
  89. Goggle: Generative modelling for tabular data by learning relational structure. In The Eleventh International Conference on Learning Representations, 2023d.
  90. Rethinking tabular data understanding with large language models, 2023e.
  91. Investigating the fairness of large language models for predictions on tabular data. Short Version in NeurIPS 2023 Workshop on Socially Responsible Language Modelling Research, 2023f.
  92. Summary of chatgpt-related research and perspective towards the future of large language models. Meta-Radiology, 1(2):100017, September 2023g. ISSN 2950-1628. doi: 10.1016/j.metrad.2023.100017. URL http://dx.doi.org/10.1016/j.metrad.2023.100017.
  93. Roberta: A robustly optimized BERT pretraining approach. CoRR, abs/1907.11692, 2019. URL http://arxiv.org/abs/1907.11692.
  94. Dnn2lr: Interpretation-inspired feature crossing for real-world tabular data, 2021.
  95. Matching table metadata with business glossaries using large language models, 2023.
  96. A review of using machine learning approaches for precision education. Educational Technology & Society, 24(1):250–266, 2021.
  97. Sdtr: Soft decision tree regressor for tabular data. IEEE Access, 9:55999–56011, 2021.
  98. Vaem: a deep generative model for heterogeneous mixed type data, 2020.
  99. Approximate, adapt, anonymize (3a): A framework for privacy preserving training data release for machine learning. 2023.
  100. Language models are weak learners, 2023.
  101. Prompt engineering in large language models. In International Conference on Data Intelligence and Cognitive Informatics, pp.  387–402. Springer, 2023.
  102. Adapting pretrained language models for solving tabular prediction problems in the electronic health record, 2023.
  103. HybriDialogue: An information-seeking dialogue dataset grounded on tabular and textual data. In Smaranda Muresan, Preslav Nakov, and Aline Villavicencio (eds.), Findings of the Association for Computational Linguistics: ACL 2022, pp.  481–492, Dublin, Ireland, May 2022. Association for Computational Linguistics. doi: 10.18653/v1/2022.findings-acl.41. URL https://aclanthology.org/2022.findings-acl.41.
  104. Fetaqa: Free-form table question answering. Trans. Assoc. Comput. Linguistics, 10:35–49, 2022. doi: 10.1162/TACL_A_00446. URL https://doi.org/10.1162/tacl_a_00446.
  105. Can foundation models wrangle your data? Proc. VLDB Endow., 16(4):738–746, 2022. doi: 10.14778/3574245.3574258. URL https://www.vldb.org/pvldb/vol16/p738-narayan.pdf.
  106. Rethinking data augmentation for tabular data in deep learning, 2023.
  107. Training language models to follow instructions with human feedback. In Sanmi Koyejo, S. Mohamed, A. Agarwal, Danielle Belgrave, K. Cho, and A. Oh (eds.), Advances in Neural Information Processing Systems 35: Annual Conference on Neural Information Processing Systems 2022, NeurIPS 2022, New Orleans, LA, USA, November 28 - December 9, 2022, 2022. URL http://papers.nips.cc/paper_files/paper/2022/hash/b1efde53be364a73914f58805a001731-Abstract-Conference.html.
  108. ToTTo: A controlled table-to-text generation dataset. In Bonnie Webber, Trevor Cohn, Yulan He, and Yang Liu (eds.), Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp.  1173–1186, Online, November 2020a. Association for Computational Linguistics. doi: 10.18653/v1/2020.emnlp-main.89. URL https://aclanthology.org/2020.emnlp-main.89.
  109. Totto: A controlled table-to-text generation dataset. CoRR, abs/2004.14373, 2020b. URL https://arxiv.org/abs/2004.14373.
  110. Data synthesis based on generative adversarial networks. Proceedings of the VLDB Endowment, 11(10):1071–1083, June 2018. ISSN 2150-8097. doi: 10.14778/3231751.3231757. URL http://dx.doi.org/10.14778/3231751.3231757.
  111. Compositional semantic parsing on semi-structured tables. In Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing of the Asian Federation of Natural Language Processing, ACL 2015, July 26-31, 2015, Beijing, China, Volume 1: Long Papers, pp.  1470–1480. The Association for Computer Linguistics, 2015a. doi: 10.3115/V1/P15-1142. URL https://doi.org/10.3115/v1/p15-1142.
  112. Compositional semantic parsing on semi-structured tables, 2015b.
  113. The synthetic data vault. In 2016 IEEE International Conference on Data Science and Advanced Analytics (DSAA), pp.  399–410, 2016. doi: 10.1109/DSAA.2016.49.
  114. Deep contextualized word representations. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), pp.  2227–2237. Association for Computational Linguistics, 2018a. doi: 10.18653/v1/N18-1202. URL https://aclanthology.org/N18-1202.
  115. Deep contextualized word representations, 2018b.
  116. Neural oblivious decision ensembles for deep learning on tabular data, 2019.
  117. Din-sql: Decomposed in-context learning of text-to-sql with self-correction, 2023.
  118. Catboost: unbiased boosting with categorical features, 2019.
  119. Exploring the limits of transfer learning with a unified text-to-text transformer, 2023.
  120. Large test collection experiments on an operational, interactive system: Okapi at TREC. Inf. Process. Manag., 31(3):345–360, 1995. doi: 10.1016/0306-4573(94)00051-4. URL https://doi.org/10.1016/0306-4573(94)00051-4.
  121. Machine learning for quantitative finance applications: A survey. Applied Sciences, 9(24):5574, 2019.
  122. Explainable artificial intelligence for tabular data: A survey. IEEE access, 9:135392–135422, 2021.
  123. Multitask prompted training enables zero-shot task generalization, 2021.
  124. Multitask prompted training enables zero-shot task generalization. In The Tenth International Conference on Learning Representations, ICLR 2022, Virtual Event, April 25-29, 2022. OpenReview.net, 2022. URL https://openreview.net/forum?id=9Vrb9D0WI4.
  125. Testing the limits of unified sequence to sequence LLM pretraining on diverse table data tasks. CoRR, abs/2310.00789, 2023. doi: 10.48550/ARXIV.2310.00789. URL https://doi.org/10.48550/arXiv.2310.00789.
  126. The use of generative adversarial networks to alleviate class imbalance in tabular data: a survey. Journal of Big Data, 9(1):98, 2022.
  127. Aggregate and mixed-order markov models for statistical language processing, 1997.
  128. Curated llm: Synergy of llms and data curation for tabular augmentation in ultra low-data regimes. arXiv preprint arXiv:2312.12112, 2023.
  129. Predictive biases in natural language processing models: A conceptual framework and overview. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pp.  5248–5264, 2020.
  130. Regularization learning networks: deep learning for tabular datasets. Advances in Neural Information Processing Systems, 31, 2018.
  131. Tabular data: Deep learning is not all you need. Information Fusion, 81:84–90, 2022.
  132. Tabular representation, noisy operators, and impacts on table structure understanding tasks in llms. CoRR, abs/2310.10358, 2023. doi: 10.48550/ARXIV.2310.10358. URL https://doi.org/10.48550/arXiv.2310.10358.
  133. Tablet: Learning from instructions for tabular data, 2023.
  134. Realtabformer: Generating realistic relational and tabular data using transformers. arXiv preprint arXiv:2302.02041, 2023.
  135. Saint: Improved neural networks for tabular data via row attention and contrastive pre-training, 2021.
  136. The first step is the hardest: Pitfalls of representing and tokenizing temporal data for large language models, 2023.
  137. Evaluating and enhancing structural understanding capabilities of large language models on tables via input designs. CoRR, abs/2305.13062, 2023a. doi: 10.48550/ARXIV.2305.13062. URL https://doi.org/10.48550/arXiv.2305.13062.
  138. Gpt4table: Can large language models understand structured table data? a benchmark and empirical study, 2023b.
  139. Tap4llm: Table provider on sampling, augmenting, and packing semi-structured data for large language model reasoning, 2023c.
  140. Supertml: Two-dimensional word embedding for the precognition on structured tabular data, 2019.
  141. Test: Text prototype aligned embedding to activate llm’s ability for time series, 2023a.
  142. Sql-palm: Improved large language model adaptation for text-to-sql. CoRR, abs/2306.00739, 2023b. doi: 10.48550/ARXIV.2306.00739. URL https://doi.org/10.48550/arXiv.2306.00739.
  143. cTBLS: Augmenting large language models with conversational tables. In Yun-Nung Chen and Abhinav Rastogi (eds.), Proceedings of the 5th Workshop on NLP for Conversational AI (NLP4ConvAI 2023), pp.  59–70, Toronto, Canada, July 2023. Association for Computational Linguistics. doi: 10.18653/v1/2023.nlp4convai-1.6. URL https://aclanthology.org/2023.nlp4convai-1.6.
  144. Efficient transformers: A survey. ACM Comput. Surv., 55(6):109:1–109:28, 2023a. doi: 10.1145/3530811. URL https://doi.org/10.1145/3530811.
  145. UL2: unifying language learning paradigms. In The Eleventh International Conference on Learning Representations, ICLR 2023, Kigali, Rwanda, May 1-5, 2023. OpenReview.net, 2023b. URL https://openreview.net/pdf?id=6ruVLB727MC.
  146. Lamda: Language models for dialog applications, 2022.
  147. Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971, 2023a.
  148. Llama 2: Open foundation and fine-tuned chat models. CoRR, abs/2307.09288, 2023b. doi: 10.48550/ARXIV.2307.09288. URL https://doi.org/10.48550/arXiv.2307.09288.
  149. Stock market prediction using machine learning (ml) algorithms. ADCAIJ: Advances in Distributed Computing and Artificial Intelligence Journal, 8(4):97–116, 2019.
  150. Generating privacy-preserving synthetic tabular data using oblivious variational autoencoders. In Proceedings of the Workshop on Economics of Privacy and Data Labor at the 37 th International Conference on Machine Learning (ICML), 2020.
  151. Attention is all you need. In Isabelle Guyon, Ulrike von Luxburg, Samy Bengio, Hanna M. Wallach, Rob Fergus, S. V. N. Vishwanathan, and Roman Garnett (eds.), Advances in Neural Information Processing Systems 30: Annual Conference on Neural Information Processing Systems 2017, December 4-9, 2017, Long Beach, CA, USA, pp.  5998–6008, 2017. URL https://proceedings.neurips.cc/paper/2017/hash/3f5ee243547dee91fbd053c1c4a845aa-Abstract.html.
  152. Pre-trained language models and their applications. Engineering, 2022a.
  153. Unipredict: Large language models are universal tabular predictors. arXiv preprint arXiv:2310.03266, 2023a.
  154. Self-consistency improves chain of thought reasoning in language models. In The Eleventh International Conference on Learning Representations, ICLR 2023, Kigali, Rwanda, May 1-5, 2023. OpenReview.net, 2023b. URL https://openreview.net/pdf?id=1PL1NIMMrw.
  155. Super-naturalinstructions: Generalization via declarative instructions on 1600+ nlp tasks, 2022b.
  156. TUTA: tree-based transformers for generally structured table pre-training. In Feida Zhu, Beng Chin Ooi, and Chunyan Miao (eds.), KDD ’21: The 27th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, Virtual Event, Singapore, August 14-18, 2021, pp.  1780–1790. ACM, 2021. doi: 10.1145/3447548.3467434. URL https://doi.org/10.1145/3447548.3467434.
  157. Transtab: Learning transferable tabular transformers across tables, 2022.
  158. Meditab: Scaling medical tabular data predictors via data consolidation, enrichment, and refinement, 2023c.
  159. Finetuned language models are zero-shot learners, 2022a.
  160. Emergent abilities of large language models. Trans. Mach. Learn. Res., 2022, 2022b. URL https://openreview.net/forum?id=yzkSU5zdwD.
  161. Chain-of-thought prompting elicits reasoning in large language models. In Sanmi Koyejo, S. Mohamed, A. Agarwal, Danielle Belgrave, K. Cho, and A. Oh (eds.), Advances in Neural Information Processing Systems 35: Annual Conference on Neural Information Processing Systems 2022, NeurIPS 2022, New Orleans, LA, USA, November 28 - December 9, 2022, 2022c. URL http://papers.nips.cc/paper_files/paper/2022/hash/9d5609613524ecf4f15af0f7b31abca4-Abstract-Conference.html.
  162. Chain-of-thought prompting elicits reasoning in large language models, 2023.
  163. Modeling tabular data using conditional gan, 2019.
  164. Detime: Diffusion-enhanced topic modeling using encoder-decoder based llm. In Findings of the Association for Computational Linguistics: EMNLP 2023. Association for Computational Linguistics, 2023a. doi: 10.18653/v1/2023.findings-emnlp.606. URL http://dx.doi.org/10.18653/v1/2023.findings-emnlp.606.
  165. Ffpdg: Fast, fair and private data generation. arXiv preprint arXiv:2307.00161, 2023b.
  166. Sqlnet: Generating structured queries from natural language without reinforcement learning, 2017.
  167. Prompt-based time series forecasting: A new task and dataset. arXiv preprint arXiv:2210.08964, 2022.
  168. Sqlizer: query synthesis from natural language. Proc. ACM Program. Lang., 1(OOPSLA), oct 2017. doi: 10.1145/3133887. URL https://doi.org/10.1145/3133887.
  169. Effective distillation of table-based reasoning ability from llms, 2023.
  170. Ct-bert: Learning better tabular representations through cross-table pre-training, 2023a.
  171. Large language models are versatile decomposers: Decomposing evidence and questions for table-based reasoning. In Hsin-Hsi Chen, Wei-Jou (Edward) Duh, Hen-Hsen Huang, Makoto P. Kato, Josiane Mothe, and Barbara Poblete (eds.), Proceedings of the 46th International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR 2023, Taipei, Taiwan, July 23-27, 2023, pp.  174–184. ACM, 2023b. doi: 10.1145/3539618.3591708. URL https://doi.org/10.1145/3539618.3591708.
  172. TRANX: A transition-based neural abstract syntax parser for semantic parsing and code generation. In Eduardo Blanco and Wei Lu (eds.), Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pp.  7–12, Brussels, Belgium, November 2018. Association for Computational Linguistics. doi: 10.18653/v1/D18-2002. URL https://aclanthology.org/D18-2002.
  173. Tabert: Pretraining for joint understanding of textual and tabular data, 2020a.
  174. Tabert: Pretraining for joint understanding of textual and tabular data. In Dan Jurafsky, Joyce Chai, Natalie Schluter, and Joel R. Tetreault (eds.), Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, ACL 2020, Online, July 5-10, 2020, pp.  8413–8426. Association for Computational Linguistics, 2020b. doi: 10.18653/V1/2020.ACL-MAIN.745. URL https://doi.org/10.18653/v1/2020.acl-main.745.
  175. TaBERT: Pretraining for joint understanding of textual and tabular data. In Dan Jurafsky, Joyce Chai, Natalie Schluter, and Joel Tetreault (eds.), Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pp.  8413–8426, Online, July 2020c. Association for Computational Linguistics. doi: 10.18653/v1/2020.acl-main.745. URL https://aclanthology.org/2020.acl-main.745.
  176. Finpt: Financial risk prediction with profile tuning on pretrained foundation models, 2023.
  177. Unified language representation for question answering over text, tables, and images. In Anna Rogers, Jordan L. Boyd-Graber, and Naoaki Okazaki (eds.), Findings of the Association for Computational Linguistics: ACL 2023, Toronto, Canada, July 9-14, 2023, pp.  4756–4765. Association for Computational Linguistics, 2023. doi: 10.18653/V1/2023.FINDINGS-ACL.292. URL https://doi.org/10.18653/v1/2023.findings-acl.292.
  178. Typesql: Knowledge-based type-aware neural text-to-sql generation, 2018a.
  179. Spider: A large-scale human-labeled dataset for complex and cross-domain semantic parsing and text-to-SQL task. In Ellen Riloff, David Chiang, Julia Hockenmaier, and Jun’ichi Tsujii (eds.), Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pp.  3911–3921, Brussels, Belgium, October-November 2018b. Association for Computational Linguistics. doi: 10.18653/v1/D18-1425. URL https://aclanthology.org/D18-1425.
  180. Cosql: A conversational text-to-sql challenge towards cross-domain natural language interfaces to databases. In Kentaro Inui, Jing Jiang, Vincent Ng, and Xiaojun Wan (eds.), Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing, EMNLP-IJCNLP 2019, Hong Kong, China, November 3-7, 2019, pp.  1962–1979. Association for Computational Linguistics, 2019a. doi: 10.18653/V1/D19-1204. URL https://doi.org/10.18653/v1/D19-1204.
  181. Sparc: Cross-domain semantic parsing in context. In Anna Korhonen, David R. Traum, and Lluís Màrquez (eds.), Proceedings of the 57th Conference of the Association for Computational Linguistics, ACL 2019, Florence, Italy, July 28- August 2, 2019, Volume 1: Long Papers, pp.  4511–4523. Association for Computational Linguistics, 2019b. doi: 10.18653/V1/P19-1443. URL https://doi.org/10.18653/v1/p19-1443.
  182. Glm-130b: An open bilingual pre-trained model, 2023.
  183. Tablegpt: Towards unifying tables, nature language and commands into one GPT. CoRR, abs/2307.08674, 2023. doi: 10.48550/ARXIV.2307.08674. URL https://doi.org/10.48550/arXiv.2307.08674.
  184. Towards foundation models for learning on tabular data, 2023a.
  185. Jellyfish: A large language model for data preprocessing. CoRR, abs/2312.01678, 2023b. doi: 10.48550/ARXIV.2312.01678. URL https://doi.org/10.48550/arXiv.2312.01678.
  186. Mixed-type tabular data synthesis with score-based diffusion in latent space. arXiv preprint arXiv:2310.09656, 2023c.
  187. Bridging the gap: Deciphering tabular data using large language model. CoRR, abs/2308.11891, 2023d. doi: 10.48550/ARXIV.2308.11891. URL https://doi.org/10.48550/arXiv.2308.11891.
  188. Privbayes: Private data release via bayesian networks. ACM Trans. Database Syst., 42(4), oct 2017. ISSN 0362-5915. doi: 10.1145/3134428. URL https://doi.org/10.1145/3134428.
  189. Generative table pre-training empowers models for tabular prediction. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pp.  14836–14854. Association for Computational Linguistics, December 2023e. doi: 10.18653/v1/2023.emnlp-main.917. URL https://aclanthology.org/2023.emnlp-main.917.
  190. Tablellama: Towards open large generalist models for tables, 2023f.
  191. Natural language interfaces for tabular data querying and visualization: A survey. CoRR, abs/2310.17894, 2023g. doi: 10.48550/ARXIV.2310.17894. URL https://doi.org/10.48550/arXiv.2310.17894.
  192. Large language models are complex table parsers. In Houda Bouamor, Juan Pino, and Kalika Bali (eds.), Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, EMNLP 2023, Singapore, December 6-10, 2023, pp.  14786–14802. Association for Computational Linguistics, 2023a. URL https://aclanthology.org/2023.emnlp-main.914.
  193. A survey of large language models, 2023b.
  194. Divknowqa: Assessing the reasoning ability of llms via open-domain question answering over knowledge base and text, 2023c.
  195. Docmath-eval: Evaluating numerical reasoning capabilities of llms in understanding long documents with tabular data. CoRR, abs/2311.09805, 2023d. doi: 10.48550/ARXIV.2311.09805. URL https://doi.org/10.48550/arXiv.2311.09805.
  196. RobuT: A systematic study of table QA robustness against human-annotated adversarial perturbations. In Anna Rogers, Jordan Boyd-Graber, and Naoaki Okazaki (eds.), Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp.  6064–6081, Toronto, Canada, July 2023e. Association for Computational Linguistics. doi: 10.18653/v1/2023.acl-long.334. URL https://aclanthology.org/2023.acl-long.334.
  197. Tabula: Harnessing language models for tabular data synthesis. arXiv preprint arXiv:2310.12746, 2023f.
  198. Seq2sql: Generating structured queries from natural language using reinforcement learning, 2017a.
  199. Seq2sql: Generating structured queries from natural language using reinforcement learning. CoRR, abs/1709.00103, 2017b.
  200. TAT-QA: A question answering benchmark on a hybrid of tabular and textual content in finance. In Chengqing Zong, Fei Xia, Wenjie Li, and Roberto Navigli (eds.), Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pp.  3277–3287, Online, August 2021a. Association for Computational Linguistics. doi: 10.18653/v1/2021.acl-long.254. URL https://aclanthology.org/2021.acl-long.254.
  201. Converting tabular data into images for deep learning with convolutional neural networks. Scientific reports, 11(1):11325, 2021b.
Citations (40)

Summary

  • The paper presents an expansive survey evaluating LLM techniques for tabular data prediction, synthesis, question answering, and table understanding.
  • It outlines key preprocessing methods such as serialization, prompt engineering, and table manipulation to adapt tabular data for LLM use.
  • The survey highlights challenges including bias, numerical representation issues, and the need for standard benchmarks while suggesting future research directions.

Leveraging LLMs for Tabular Data: A Comprehensive Survey

Overview

Recent advancements in LLMs have opened up a promising avenue for modeling tabular data across various applications, despite being inherently designed for natural language processing tasks. This survey paper offers a detailed examination of how LLMs are applied to tabular data modeling, focusing on prediction tasks, data synthesis, question answering (QA), and table understanding. By identifying the techniques for preparing tabular data for LLMs, comparing methodologies, and highlighting the challenges and future research directions, this survey serves as a foundational piece for researchers and practitioners in the field of Machine Learning (ML) and AI.

Key Techniques for LLM Applications on Tabular Data

The survey outlines critical steps necessary for processing tabular data to make it compatible with LLMs, including:

  • Serialization: Transforming tabular data into a text or embedding format, thereby enabling input to LLMs.
  • Table Manipulation: Techniques to compact tables for efficient processing, including strategies for handling large datasets that exceed LLM context length limitations.
  • Prompt Engineering: Refining input prompts to guide LLMs towards generating accurate outputs by introducing examples, task descriptions, and iterative approaches for complex reasoning.

Methodologies Across Different Tasks

For each of the focal areas—prediction, data synthesis, QA, and table understanding—the paper conducts an extensive review of methodologies, highlighting the typical pipeline, evaluation metrics, and notable models employed. It becomes evident that while LLMs have shown the capability to understand and generate tabular data, the use of prompt engineering, in-context learning examples, and fine-tuning strategies are crucial for enhancing performance on specific tasks.

Challenges and Limitations

Despite the progress, several limitations are emphasized:

  • Bias and Fairness: The inherited social bias from training data of LLMs poses a challenge for fairness in applications.
  • Hallucination: LLMs' tendency to generate information not present in the input data leads to reliability concerns.
  • Numerical and Categorical Representation Issues: Effective encoding of numerical and categorical features remains a technical hurdle.
  • Lack of Standard Benchmarks: The field suffers from heterogeneity in datasets and metrics, calling for standardized benchmarks for fair comparison.
  • Interpretability and Accessibility: The black-box nature of LLMs complicates result interpretation, and the complexity of model fine-tuning limits accessibility for non-experts.
  • Fine-tuning Strategy and Model Grafting: Designing effective fine-tuning strategies and integrating model grafting techniques to include non-text data represent future research directions.

Recommendations and Speculations on Future Developments

The paper advocates for the exploration of advanced tokenization and embedding techniques to better represent tabular data, efforts towards developing unified benchmarks, and the investigation of model grafting to incorporate diverse data types seamlessly. Moreover, addressing bias, ensuring fairness, and enhancing interpretability are underscored as vital considerations for future research.

Conclusion

This comprehensive survey underscores the potential of LLMs in transcending the boundaries of natural language understanding to contribute significantly to tabular data modeling tasks. By systematically evaluating methodologies, identifying challenges, and recommending research directions, the survey equips researchers and practitioners with insights crucial for harnessing LLMs in this promising intersection of AI and data science. As LLMs continue to evolve, their integration with tabular data applications stands to revolutionize data analysis, prediction, and reasoning tasks, heralding a new era of AI-driven insights across various domains.

Dice Question Streamline Icon: https://streamlinehq.com

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

List To Do Tasks Checklist Streamline Icon: https://streamlinehq.com

Collections

Sign up for free to add this paper to one or more collections.

X Twitter Logo Streamline Icon: https://streamlinehq.com

Tweets

This paper has been mentioned in 3 tweets and received 30 likes.

Upgrade to Pro to view all of the tweets about this paper:

Youtube Logo Streamline Icon: https://streamlinehq.com