- The paper introduces a multi-stage pipeline leveraging agentic LLMs for dynamic NL-to-SQL translation, significantly improving tabular question answering.
- It employs embedding-based example selection, schema-informed prompts, and Chain-of-Thought reasoning to refine and verify SQL queries.
- Experimental results on SemEval 2025 tasks show accuracy improvements from around 27% to over 70%, demonstrating the system's robustness.
Agentic LLMs for Question Answering over Tabular Data
The paper "Agentic LLMs for Question Answering over Tabular Data" (2509.09234) presents a novel approach using LLMs as agentic systems for natural language to SQL (NL-to-SQL) translations to address the challenges of Question Answering (QA) over tabular data. Given the complexity of structured queries in diverse real-world tables, the research introduces a multi-stage pipeline, leveraging models such as GPT-4o, GPT-4o-mini, and DeepSeek v2:16b, to enhance the dynamic generation and refinement of SQL queries.
Introduction to Table QA
Table QA is pivotal within NLP for facilitating user interaction with databases without SQL proficiency. This domain poses challenges due to the requirement of understanding complex schema and generating precise logical forms. Earlier solutions, including rule-based systems and retrieval-augmented generation (RAG), were limited by their scalability and effectiveness in handling extensive tabular relationships. The deployment of LLMs enables the dynamic derivation of SQL queries without predefined frameworks, although issues such as structural errors in generated SQL remain.
Methodology
The methodology employs a multi-stage NL-to-SQL pipeline, comprising:
- Example Selection: Uses embedding-based similarity to choose relevant question-query pairs as context, thereby enhancing SQL query generation.
- SQL Query Generation: Structured prompts crafted with schema context guide LLMs to produce accurate SQL statements, leveraging an initial retrieval step to extract pertinent table rows rather than exact answers.
- Answer Extraction and Formatting: Employs Chain-of-Thought (CoT) prompting for logical reasoning, refining the extraction and formatting of answers from tabular SQL outputs.
- Verification: Involves classifying responses based on format validity and relevance, minimizing errors through a verification process.
- Reprocessing: Iterative refinement is adopted for flagged responses, improving extraction accuracy and alignment with expected output types.
Experimental Results
The proposed system was evaluated using SemEval 2025's DataBench and DataBench-Lite tasks. The pipeline achieved a notable accuracy of 70.5% and 71.6% on these tasks, respectively, outperforming the baseline of around 26-27%. This success is attributed to embedding-based examples enhancing query contextualization and CoT reasoning improving answer accuracy. GPT-4o demonstrated the highest accuracy, owing to its adept handling of complex query understanding.
Conclusion
The paper significantly advances LLM-driven NL-to-SQL translation with a robust multi-stage pipeline, establishing an improvement in query accuracy and reliability. Future work, indicated in the paper, involves addressing the identified categories of errors, such as complex numerical reasoning mishandling and categorical misclassification, through advancements in LLM architectures and refined NL-to-SQL frameworks. The implications of this research suggest both theoretical and practical advancements in automating database interactions and improving structured data accessibility.