Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
97 tokens/sec
GPT-4o
53 tokens/sec
Gemini 2.5 Pro Pro
44 tokens/sec
o3 Pro
5 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

ChatBI: Towards Natural Language to Complex Business Intelligence SQL (2405.00527v1)

Published 1 May 2024 in cs.DB

Abstract: The Natural Language to SQL (NL2SQL) technology provides non-expert users who are unfamiliar with databases the opportunity to use SQL for data analysis.Converting Natural Language to Business Intelligence (NL2BI) is a popular practical scenario for NL2SQL in actual production systems. Compared to NL2SQL, NL2BI introduces more challenges. In this paper, we propose ChatBI, a comprehensive and efficient technology for solving the NL2BI task. First, we analyze the interaction mode, an important module where NL2SQL and NL2BI differ in use, and design a smaller and cheaper model to match this interaction mode. In BI scenarios, tables contain a huge number of columns, making it impossible for existing NL2SQL methods that rely on LLMs for schema linking to proceed due to token limitations. The higher proportion of ambiguous columns in BI scenarios also makes schema linking difficult. ChatBI combines existing view technology in the database community to first decompose the schema linking problem into a Single View Selection problem and then uses a smaller and cheaper machine learning model to select the single view with a significantly reduced number of columns. The columns of this single view are then passed as the required columns for schema linking into the LLM. Finally, ChatBI proposes a phased process flow different from existing process flows, which allows ChatBI to generate SQL containing complex semantics and comparison relations more accurately. We have deployed ChatBI on Baidu's data platform and integrated it into multiple product lines for large-scale production task evaluation. The obtained results highlight its superiority in practicality, versatility, and efficiency. At the same time, compared with the current mainstream NL2SQL technology under our real BI scenario data tables and queries, it also achieved the best results.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (34)
  1. [n.d.]. Apache Superset. https://superset.apache.org/.
  2. [n.d.]. Claude3. https://www.anthropic.com/claude.
  3. [n.d.]. Erinebot. https://yiyan.baidu.com/.
  4. [n.d.]. LLama. https://llama.meta.com/.
  5. [n.d.]. LLM Ranks. https://rank.opencompass.org.cn/leaderboard-llm.
  6. [n.d.]. OpenAI GPT. https://openai.com/pricing.
  7. [n.d.]. OpenAI pricing. https://openai.com/pricing.
  8. [n.d.]. OpenAICodex. https://openai.com/blog/openai-codex.
  9. Sadga: Structure-aware dual graph aggregation network for text-to-sql. Advances in Neural Information Processing Systems 34 (2021), 7664–7676.
  10. LGESQL: line graph enhanced text-to-SQL model with mixed local and non-local relations. arXiv preprint arXiv:2106.01093 (2021).
  11. Ryansql: Recursively applying sketch-based slot fillings for complex text-to-sql in cross-domain databases. Computational Linguistics 47, 2 (2021), 309–332.
  12. C3: Zero-shot text-to-sql with chatgpt. arXiv preprint arXiv:2307.07306 (2023).
  13. Nl2sql is a solved problem… not!. In 14th Annual Conference on Innovative Data Systems Research (CIDR’24).
  14. Catsql: Towards real world natural language to sql applications. Proceedings of the VLDB Endowment 16, 6 (2023), 1534–1547.
  15. Natural SQL: Making SQL easier to infer from natural language specifications. arXiv preprint arXiv:2109.05153 (2021).
  16. Text-to-sql empowered by large language models: A benchmark evaluation. arXiv preprint arXiv:2308.15363 (2023).
  17. Alex Graves and Alex Graves. 2012. Long short-term memory. Supervised sequence labelling with recurrent neural networks (2012), 37–45.
  18. TaPas: Weakly supervised table parsing via pre-training. arXiv preprint arXiv:2004.02349 (2020).
  19. A comprehensive exploration on wikisql with table-aware word contextualization. arXiv preprint arXiv:1902.01069 (2019).
  20. Decomposed prompting: A modular approach for solving complex tasks. arXiv preprint arXiv:2210.02406 (2022).
  21. CodeS: Towards Building Open-source Language Models for Text-to-SQL. arXiv preprint arXiv:2402.16347 (2024).
  22. Can llm already serve as a database interface? a big bench for large-scale database grounded text-to-sqls. Advances in Neural Information Processing Systems 36 (2024).
  23. Sharan Narang and Aakanksha Chowdhery. 2022. Pathways language model (palm): Scaling to 540 billion parameters for breakthrough performance. Google AI Blog (2022).
  24. Mohammadreza Pourreza and Davood Rafiei. 2024. Din-sql: Decomposed in-context learning of text-to-sql with self-correction. Advances in Neural Information Processing Systems 36 (2024).
  25. Sql-palm: Improved large language modeladaptation for text-to-sql. arXiv preprint arXiv:2306.00739 (2023).
  26. Ernie: Enhanced representation through knowledge integration. arXiv preprint arXiv:1904.09223 (2019).
  27. Sequence to sequence learning with neural networks. Advances in neural information processing systems 27 (2014).
  28. Mac-sql: Multi-agent collaboration for text-to-sql. arXiv preprint arXiv:2312.11242 (2023).
  29. Rat-sql: Relation-aware schema encoding and linking for text-to-sql parsers. arXiv preprint arXiv:1911.04942 (2019).
  30. Chain-of-thought prompting elicits reasoning in large language models. Advances in neural information processing systems 35 (2022), 24824–24837.
  31. TaBERT: Pretraining for joint understanding of textual and tabular data. arXiv preprint arXiv:2005.08314 (2020).
  32. Grappa: Grammar-augmented pre-training for table semantic parsing. arXiv preprint arXiv:2009.13845 (2020).
  33. Spider: A large-scale human-labeled dataset for complex and cross-domain semantic parsing and text-to-sql task. arXiv preprint arXiv:1809.08887 (2018).
  34. Least-to-most prompting enables complex reasoning in large language models. arXiv preprint arXiv:2205.10625 (2022).
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (9)
  1. Jinqing Lian (2 papers)
  2. Xinyi Liu (58 papers)
  3. Yingxia Shao (54 papers)
  4. Yang Dong (28 papers)
  5. Ming Wang (59 papers)
  6. Zhang Wei (14 papers)
  7. Tianqi Wan (8 papers)
  8. Ming Dong (38 papers)
  9. Hailin Yan (1 paper)
Citations (1)

Summary

We haven't generated a summary for this paper yet.