On Repairing Natural Language to SQL Queries (2310.03866v1)
Abstract: Data analysts use SQL queries to access and manipulate data on their databases. However, these queries are often challenging to write, and small mistakes can lead to unexpected data output. Recent work has explored several ways to automatically synthesize queries based on a user-provided specification. One promising technique called text-to-SQL consists of the user providing a natural language description of the intended behavior and the database's schema. Even though text-to-SQL tools are becoming more accurate, there are still many instances where they fail to produce the correct query. In this paper, we analyze when text-to-SQL tools fail to return the correct query and show that it is often the case that the returned query is close to a correct query. We propose to repair these failing queries using a mutation-based approach that is agnostic to the text-to-SQL tool being used. We evaluate our approach on two recent text-to-SQL tools, RAT-SQL and SmBoP, and show that our approach can repair a significant number of failing queries.
- An evaluation of similarity coefficients for software fault localization. In Proc. Pacific Rim International Symposium on Dependable Computing, pages 39–46. IEEE, 2006.
- LGESQL: line graph enhanced text-to-sql model with mixed local and non-local relations. In Chengqing Zong, Fei Xia, Wenjie Li, and Roberto Navigli, editors, Proc. Annual Meeting of the Association for Computational Linguistics, pages 2541–2555. Association for Computational Linguistics, 2021.
- Natural SQL: making SQL easier to infer from natural language specifications. In Proc. International Conference on Empirical Methods in Natural Language Processing, pages 2030–2042. Association for Computational Linguistics, 2021.
- Automatically repairing sql faults. In Proc. IEEE International Conference on Software Quality, Reliability and Security, pages 500–511. IEEE, 2018.
- Genprog: A generic method for automatic software repair. Ieee transactions on software engineering, 38(1):54–72, 2011.
- Bridging Textual and Tabular Data for Cross-Domain Text-to-SQL Semantic Parsing. In Proc. International Conference on Empirical Methods in Natural Language Processing, pages 4870–4888. Association for Computational Linguistics, 2020.
- SQUARES: A SQL synthesizer using query reverse engineering. Proceedings of the VLDB Endowment, 13(12):2853–2856, August 2020.
- SQLRepair: Identifying and Repairing Mistakes in Student-Authored SQL Queries. In Proc. IEEE/ACM International Conference on Software Engineering: Software Engineering Education and Training, pages 199–210. IEEE, 2021.
- Learning string-edit distance. IEEE Transactions on Pattern Analysis and Machine Intelligence, 20(5):522–532, 1998.
- Smbop: Semi-autoregressive bottom-up semantic parsing. In NAACL-HLT, pages 311–324. Association for Computational Linguistics, 2021.
- PICARD: parsing incrementally for constrained auto-regressive decoding from language models. In Proc. International Conference on Empirical Methods in Natural Language Processing, pages 9895–9901. Association for Computational Linguistics, 2021.
- Learning contextual representations for semantic parsing with generation-augmented pre-training. In Proc. AAAI Conference on Artificial Intelligence, pages 13806–13814. AAAI Press, 2021.
- PATSQL: efficient synthesis of SQL queries from example tables with quick inference of projected columns. Proc. VLDB Endow., 14(11):1937–1949, 2021.
- Synthesizing Highly Expressive SQL Queries from Input-output Examples. In Proc. Conference on Programming Language Design and Implementation, pages 452–466, New York, NY, USA, 2017. ACM.
- RAT-SQL: Relation-Aware Schema Encoding and Linking for Text-to-SQL Parsers. In ACL, pages 7567–7578. Association for Computational Linguistics, 2020.
- The DStar method for effective software fault localization. IEEE Transactions on Reliability, 63(1):290–308, 2013.
- SQLizer: Query Synthesis from Natural Language. Proc. ACM Program. Lang., 1(OOPSLA):63:1–63:26, October 2017.
- Spider: A large-scale human-labeled dataset for complex and cross-domain semantic parsing and text-to-sql task. In Proc. International Conference on Empirical Methods in Natural Language Processing, pages 3911–3921. Association for Computational Linguistics, 2018.
- Grappa: Grammar-augmented pre-training for table semantic parsing. In Proc. International Conference on Learning Representations. OpenReview.net, 2021.
- Aidan Z. H. Yang (6 papers)
- Ricardo Brancas (4 papers)
- Pedro Esteves (1 paper)
- Sofia Aparicio (4 papers)
- Joao Pedro Nadkarni (1 paper)
- Miguel Terra-Neves (5 papers)
- Vasco Manquinho (27 papers)
- Ruben Martins (24 papers)