Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
119 tokens/sec
GPT-4o
56 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

StructLM: Towards Building Generalist Models for Structured Knowledge Grounding (2402.16671v7)

Published 26 Feb 2024 in cs.CL

Abstract: Structured data sources, such as tables, graphs, and databases, are ubiquitous knowledge sources. Despite the demonstrated capabilities of LLMs on plain text, their proficiency in interpreting and utilizing structured data remains limited. Our investigation reveals a notable deficiency in LLMs' ability to process structured data, e.g., ChatGPT lags behind state-of-the-art (SoTA) model by an average of 35%. To augment the Structured Knowledge Grounding (SKG) capabilities in LLMs, we have developed a comprehensive instruction tuning dataset comprising 1.1 million examples. Utilizing this dataset, we train a series of models, referred to as StructLM, based on the Mistral and the CodeLlama model family, ranging from 7B to 34B parameters. Our StructLM series surpasses task-specific models on 16 out of 18 evaluated datasets and establishes new SoTA performance on 8 SKG tasks. Furthermore, StructLM demonstrates strong generalization across 6 novel held-out SKG tasks, outperforming TableLlama by an average of 35\% and Flan-UL2 20B by an average of 10\%. Contrary to expectations, we observe that scaling model size offers marginal benefits, with StructLM-34B showing only slight improvements over StructLM-7B. This suggests that structured knowledge grounding is still a challenging task and requires more innovative design to push to a new level.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (63)
  1. FEVEROUS: Fact extraction and VERification over unstructured and structured information.
  2. Llemma: An open language model for mathematics. CoRR, abs/2310.10631.
  3. Llemma: An open language model for mathematics.
  4. Table-to-text: Describing table region with natural language. Proceedings of the AAAI Conference on Artificial Intelligence, 32(1).
  5. MultiWOZ - a large-scale multi-domain Wizard-of-Oz dataset for task-oriented dialogue modelling. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 5016–5026, Brussels, Belgium. Association for Computational Linguistics.
  6. Tabfact: A large-scale dataset for table-based fact verification. In International Conference on Learning Representations.
  7. Hybridqa: A dataset of multi-hop question answering over tabular and textual data. Findings of EMNLP 2020.
  8. Finqa: A dataset of numerical reasoning over financial data. Proceedings of EMNLP 2021.
  9. Logic2Text: High-fidelity natural language generation from logical forms. In Findings of the Association for Computational Linguistics: EMNLP 2020, pages 2096–2111, Online. Association for Computational Linguistics.
  10. Binding language models in symbolic languages. In The Eleventh International Conference on Learning Representations.
  11. Control prefixes for parameter-efficient text generation. arXiv preprint arXiv:2110.08329.
  12. Preview, attend and review: Schema-aware curriculum learning for multi-domain dialog state tracking.
  13. Key-value retrieval networks for task-oriented dialogue. In Proceedings of the 18th Annual SIGdial Meeting on Discourse and Dialogue. Association for Computational Linguistics.
  14. Don’t generate, discriminate: A proposal for grounding language models to real-world environments.
  15. Beyond iid: three levels of generalization for question answering on knowledge bases. In Proceedings of the Web Conference 2021, pages 3477–3488. ACM.
  16. MMQA: A multi-domain multi-lingual question-answering framework for English and Hindi. In Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018), Miyazaki, Japan. European Language Resources Association (ELRA).
  17. Infotabs: Inference on tables as semi-structured data. arXiv preprint arXiv:2005.06117.
  18. INFOTABS: Inference on tables as semi-structured data. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 2309–2324, Online. Association for Computational Linguistics.
  19. Measuring mathematical problem solving with the math dataset.
  20. Search-based neural structured learning for sequential question answering. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1821–1831, Vancouver, Canada. Association for Computational Linguistics.
  21. StructGPT: A General Framework for Large Language Model to Reason over Structured Data. ArXiv:2305.09645 [cs].
  22. MAFiD: Moving average equipped fusion-in-decoder for question answering over tabular and textual data. In Findings of the Association for Computational Linguistics: EACL 2023, pages 2337–2344, Dubrovnik, Croatia. Association for Computational Linguistics.
  23. MTOP: A comprehensive multilingual task-oriented semantic parsing benchmark. In Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume, pages 2950–2962, Online. Association for Computational Linguistics.
  24. Resdsql: Decoupling schema linking and skeleton parsing for text-to-sql. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 37, pages 13067–13075.
  25. Can llm already serve as a database interface? a big bench for large-scale database grounded text-to-sqls.
  26. Starcoder: may the source be with you! CoRR, abs/2305.06161.
  27. Unifying structured data as graph for data-to-text pre-training. ArXiv, abs/2401.01183.
  28. Competition-level code generation with alphacode. CoRR, abs/2203.07814.
  29. Slimorca: An open dataset of gpt-4 augmented flan reasoning traces, with verification.
  30. Ptab: Using the pre-trained language model for modeling tabular data. CoRR, abs/2209.08060.
  31. Tapex: Table pre-training via learning a neural sql executor.
  32. Dynamic prompt learning via policy gradient for semi-structured mathematical reasoning. In International Conference on Learning Representations (ICLR).
  33. Wizardmath: Empowering mathematical reasoning for large language models via reinforced evol-instruct. arXiv preprint arXiv:2308.09583.
  34. Chatkbqa: A generate-then-retrieve framework for knowledge base question answering with fine-tuned large language models.
  35. Wizardcoder: Empowering code large language models with evol-instruct. arXiv preprint arXiv:2306.08568.
  36. Fetaqa: Free-form table question answering. Transactions of the Association for Computational Linguistics, 10:35–49.
  37. DART: Open-domain structured data record to text generation. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 432–447, Online. Association for Computational Linguistics.
  38. ToTTo: A controlled table-to-text generation dataset. In Proceedings of EMNLP.
  39. Panupong Pasupat and Percy Liang. 2015. Compositional semantic parsing on semi-structured tables. In Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pages 1470–1480, Beijing, China. Association for Computational Linguistics.
  40. RASAT: integrating relational structures into pretrained seq2seq model for text-to-sql. In EMNLP, pages 3215–3229. Association for Computational Linguistics.
  41. Rasat: Integrating relational structures into pretrained seq2seq model for text-to-sql. arXiv preprint arXiv:2205.06983.
  42. Creator: Tool creation for disentangling abstract and concrete reasoning of large language models.
  43. Toolllm: Facilitating large language models to master 16000+ real-world apis.
  44. Code llama: Open foundation models for code. CoRR, abs/2308.12950.
  45. Picard: Parsing incrementally for constrained auto-regressive decoding from language models. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics.
  46. Logic-consistency text generation from semantic parses. In Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021, pages 4414–4426, Online. Association for Computational Linguistics.
  47. Apollo: An optimized training approach for long-form numerical reasoning. arXiv preprint arXiv:2212.07249.
  48. Alon Talmor and Jonathan Berant. 2018. The web as a knowledge-base for answering complex questions. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers). Association for Computational Linguistics.
  49. UL2: unifying language learning paradigms. In ICLR. OpenReview.net.
  50. Ul2: Unifying language learning paradigms.
  51. Llama 2: Open foundation and fine-tuned chat models. CoRR, abs/2307.09288.
  52. Finetuned language models are zero-shot learners. In ICLR. OpenReview.net.
  53. Cqr-sql: Conversational question reformulation enhanced context-dependent text-to-sql parsers.
  54. Unifiedskg: Unifying and multi-tasking structured knowledge grounding with text-to-text language models. Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing.
  55. Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244.
  56. Sead: End-to-end text-to-sql generation with schema-aware denoising. arXiv preprint arXiv:2105.07911.
  57. Large language models are versatile decomposers: Decomposing evidence and questions for table-based reasoning. In Proceedings of the 46th International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 174–184.
  58. Grappa: Grammar-augmented pre-training for table semantic parsing. In International Conference on Learning Representations.
  59. Spider: A large-scale human-labeled dataset for complex and cross-domain semantic parsing and text-to-sql task. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics.
  60. Sparc: Cross-domain semantic parsing in context. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics. Association for Computational Linguistics.
  61. Tablellama: Towards open large generalist models for tables.
  62. Multihiertt: Numerical reasoning over multi hierarchical tabular and textual data. In ACL (1), pages 6588–6600. Association for Computational Linguistics.
  63. Seq2sql: Generating structured queries from natural language using reinforcement learning. CoRR, abs/1709.00103.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (10)
  1. Alex Zhuang (5 papers)
  2. Ge Zhang (170 papers)
  3. Tianyu Zheng (28 papers)
  4. Xinrun Du (23 papers)
  5. Junjie Wang (164 papers)
  6. Weiming Ren (12 papers)
  7. Stephen W. Huang (9 papers)
  8. Jie Fu (229 papers)
  9. Xiang Yue (72 papers)
  10. Wenhu Chen (134 papers)
Citations (10)

Summary

StructLM: Building Generalist Models for Structured Knowledge Grounding

The paper "StructLM: Towards Building Generalist Models for Structured Knowledge Grounding" presents a novel approach to enhancing LLMs to effectively process structured data sources such as tables, graphs, and databases. Despite the proficiency of LLMs with unstructured text, their capabilities with structured data have shown significant limitations. The researchers identified a marked deficiency in LLMs to handle structured inputs, with an example analysis demonstrating that ChatGPT underperforms against state-of-the-art (SoTA) models by 35% on average.

Main Contributions

The authors aimed to improve LLMs' Structured Knowledge Grounding (SKG) abilities by designing an extensive instruction tuning dataset encompassing 1.1 million examples. Utilizing this dataset, various models, collectively named StructLM, were trained based on the CodeLlama architecture with parameters ranging from 7B to 34B. Remarkably, StructLM models surpassed task-specific models across 14 of 18 evaluated datasets, achieving new SoTA results on 7 SKG tasks, and displaying superior generalization across novel tasks. Notably, the results revealed that mere scaling of model size offered marginal gains, as StructLM-34B showed only slight improvements over StructLM-7B, suggesting that structured knowledge grounding remains a challenging domain requiring innovative approaches.

Evaluation and Results

The StructLM models were meticulously evaluated against prominent baselines like GPT-3.5-Turbo and task-specific models. The findings demonstrated that the StructLM series not only exceeded SoTA results on several tasks but also offered a parameter-efficient solution. While the inferior capabilities of LLMs like ChatGPT on these tasks were made evident, StructLM's performance highlights the benefit of focused instruction tuning on structured tasks. The findings also showed improved cross-task generalization when utilizing a mixed dataset, compared to single-task models.

Ablation Studies

Further analysis was conducted to examine the effects of pretraining data types and the role of general instruction data. Code-pretrained models showed an edge in performance across diverse SKG tasks. The inclusion of general instruction data was found to significantly enhance zero-shot performance on held-out tasks, reducing overfitting to specific training formats.

Implications and Future Directions

The implications of this research stretch across both practical and theoretical domains. Practically, StructLM can enhance automation capabilities in applications involving databases and knowledge graphs, potentially streamlining question-answering, summarization, and fact verification. Theoretically, the findings suggest that specialized pretraining, such as on structured data formats, could prove worthwhile.

The paper identifies critical areas for further exploration, such as developing more diverse structured data representations during pretraining and employing constrained LLM evaluation methods. These directions point toward broadening the capabilities of LLMs in processing structured data and establishing SKG as a foundational capability.

The research represents a significant stride in addressing the structured knowledge grounding challenges, establishing a robust baseline for future advancements in LLM capabilities.