Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
41 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Battle of the Large Language Models: Dolly vs LLaMA vs Vicuna vs Guanaco vs Bard vs ChatGPT -- A Text-to-SQL Parsing Comparison (2310.10190v1)

Published 16 Oct 2023 in cs.CL and cs.AI

Abstract: The success of ChatGPT has ignited an AI race, with researchers striving to develop new LLMs that can match or surpass the language understanding and generation abilities of commercial ones. In recent times, a number of models have emerged, claiming performance near that of GPT-3.5 or GPT-4 through various instruction-tuning methods. As practitioners of Text-to-SQL parsing, we are grateful for their valuable contributions to open-source research. However, it is important to approach these claims with a sense of scrutiny and ascertain the actual effectiveness of these models. Therefore, we pit six popular LLMs against each other, systematically evaluating their Text-to-SQL parsing capability on nine benchmark datasets with five different prompting strategies, covering both zero-shot and few-shot scenarios. Regrettably, the open-sourced models fell significantly short of the performance achieved by closed-source models like GPT-3.5, highlighting the need for further work to bridge the performance gap between these models.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (44)
  1. Palm 2 technical report.
  2. Pythia: A suite for analyzing large language models across training and scaling.
  3. Language models are few-shot learners. In Advances in Neural Information Processing Systems, volume 33, pages 1877–1901. Curran Associates, Inc.
  4. Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901.
  5. Shuaichen Chang and Eric Fosler-Lussier. 2023. How to prompt llms for text-to-sql: A study in zero-shot, single-domain, and cross-domain settings.
  6. Vicuna: An open-source chatbot impressing gpt-4 with 90%* chatgpt quality.
  7. Expanding the scope of the ATIS task: The ATIS-3 corpus. In Human Language Technology: Proceedings of a Workshop held at Plainsboro, New Jersey, March 8-11, 1994.
  8. Knowledge base question answering for space debris queries.
  9. Qlora: Efficient finetuning of quantized llms.
  10. Bert: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 4171–4186.
  11. Memorization vs. generalization : Quantifying data leakage in NLP performance evaluation. In Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume, pages 1325–1335, Online. Association for Computational Linguistics.
  12. Improving text-to-SQL evaluation methodology. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 351–360, Melbourne, Australia. Association for Computational Linguistics.
  13. Generate then select: Open-ended visual question answering guided by world knowledge.
  14. A case-based reasoning framework for adaptive prompting in cross-domain text-to-sql.
  15. Lora: Low-rank adaptation of large language models.
  16. Learning a neural semantic parser from user feedback.
  17. Unite: A unified benchmark for text-to-sql evaluation.
  18. Question and answer test-train overlap in open-domain question answering datasets. In Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume, pages 1000–1008, Online. Association for Computational Linguistics.
  19. Fei Li and Hosagrahar V Jagadish. 2014. Constructing an interactive natural language interface for relational databases. Proceedings of the VLDB Endowment, 8(1):73–84.
  20. Resdsql: Decoupling schema linking and skeleton parsing for text-to-sql.
  21. Codeie: Large code generation models are better few-shot information extractors.
  22. A comprehensive evaluation of chatgpt’s zero-shot text-to-sql capability.
  23. Xiping Liu and Zhao Tan. 2023. Divide and prompt: Chain of thought prompting for text-to-sql.
  24. Inbal Magar and Roy Schwartz. 2022. Data contamination: From memorization to exploitation. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), pages 157–165, Dublin, Ireland. Association for Computational Linguistics.
  25. Enhancing few-shot text-to-sql capabilities of large language models: A study on prompt design strategies.
  26. Modern natural language interfaces to databases: Composing statistical parsing with semantic tractability. In COLING 2004: Proceedings of the 20th International Conference on Computational Linguistics, pages 141–147, Geneva, Switzerland. COLING.
  27. Mohammadreza Pourreza and Davood Rafiei. 2023. Din-sql: Decomposed in-context learning of text-to-sql with self-correction.
  28. Patti J. Price. 1990. Evaluation of spoken language systems: the atis domain. In Human Language Technology - The Baltic Perspectiv.
  29. Exploring the limits of transfer learning with a unified text-to-text transformer. J. Mach. Learn. Res., 21(1).
  30. Improving generalization in language model-based text-to-sql semantic parsing: Two simple semantic boundary-based techniques.
  31. Evaluating the text-to-sql capabilities of large language models.
  32. Exploring unexplored generalization challenges for cross-database semantic parsing. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 8372–8388, Online. Association for Computational Linguistics.
  33. Exploring chain-of-thought style prompting for text-to-sql.
  34. Lappoon R Tang and Raymond Mooney. 2000. Automated construction of database interfaces: Intergrating statistical and relational learning for semantic parsing. In 2000 Joint SIGDAT Conference on Empirical Methods in Natural Language Processing and Very Large Corpora, pages 133–141.
  35. Lamda: Language models for dialog applications.
  36. Llama: Open and efficient foundation language models.
  37. Finetuned language models are zero-shot learners. In International Conference on Learning Representations.
  38. Chain of thought prompting elicits reasoning in large language models. arXiv preprint arXiv:2201.11903.
  39. Type- and content-driven synthesis of sql queries from natural language.
  40. Spider: A large-scale human-labeled dataset for complex and cross-domain semantic parsing and text-to-SQL task. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 3911–3921, Brussels, Belgium. Association for Computational Linguistics.
  41. SParC: Cross-domain semantic parsing in context. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 4511–4523, Florence, Italy. Association for Computational Linguistics.
  42. John M. Zelle and Raymond J. Mooney. 1996. Learning to parse database queries using inductive logic programming. In AAAI/IAAI, Vol. 2.
  43. Semantic evaluation for text-to-sql with distilled test suites. arXiv preprint arXiv:2010.02840.
  44. Least-to-most prompting enables complex reasoning in large language models.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (7)
  1. Shuo Sun (91 papers)
  2. Yuchen Zhang (112 papers)
  3. Jiahuan Yan (16 papers)
  4. Yuze Gao (4 papers)
  5. Donovan Ong (3 papers)
  6. Bin Chen (546 papers)
  7. Jian Su (18 papers)
Citations (8)