Agentic LLMs for Question Answering over Tabular Data (2509.09234v1)

Published 11 Sep 2025 in cs.CL

Abstract: Question Answering over Tabular Data (Table QA) presents unique challenges due to the diverse structure, size, and data types of real-world tables. The SemEval 2025 Task 8 (DataBench) introduced a benchmark composed of large-scale, domain-diverse datasets to evaluate the ability of models to accurately answer structured queries. We propose a Natural Language to SQL (NL-to-SQL) approach leveraging LLMs such as GPT-4o, GPT-4o-mini, and DeepSeek v2:16b to generate SQL queries dynamically. Our system follows a multi-stage pipeline involving example selection, SQL query generation, answer extraction, verification, and iterative refinement. Experiments demonstrate the effectiveness of our approach, achieving 70.5\% accuracy on DataBench QA and 71.6\% on DataBench Lite QA, significantly surpassing baseline scores of 26\% and 27\% respectively. This paper details our methodology, experimental results, and alternative approaches, providing insights into the strengths and limitations of LLM-driven Table QA.

Summary

The paper introduces a multi-stage pipeline leveraging agentic LLMs for dynamic NL-to-SQL translation, significantly improving tabular question answering.
It employs embedding-based example selection, schema-informed prompts, and Chain-of-Thought reasoning to refine and verify SQL queries.
Experimental results on SemEval 2025 tasks show accuracy improvements from around 27% to over 70%, demonstrating the system's robustness.

Agentic LLMs for Question Answering over Tabular Data

The paper "Agentic LLMs for Question Answering over Tabular Data" (2509.09234) presents a novel approach using LLMs as agentic systems for natural language to SQL (NL-to-SQL) translations to address the challenges of Question Answering (QA) over tabular data. Given the complexity of structured queries in diverse real-world tables, the research introduces a multi-stage pipeline, leveraging models such as GPT-4o, GPT-4o-mini, and DeepSeek v2:16b, to enhance the dynamic generation and refinement of SQL queries.

Introduction to Table QA

Table QA is pivotal within NLP for facilitating user interaction with databases without SQL proficiency. This domain poses challenges due to the requirement of understanding complex schema and generating precise logical forms. Earlier solutions, including rule-based systems and retrieval-augmented generation (RAG), were limited by their scalability and effectiveness in handling extensive tabular relationships. The deployment of LLMs enables the dynamic derivation of SQL queries without predefined frameworks, although issues such as structural errors in generated SQL remain.

Methodology

The methodology employs a multi-stage NL-to-SQL pipeline, comprising:

Example Selection: Uses embedding-based similarity to choose relevant question-query pairs as context, thereby enhancing SQL query generation.
SQL Query Generation: Structured prompts crafted with schema context guide LLMs to produce accurate SQL statements, leveraging an initial retrieval step to extract pertinent table rows rather than exact answers.
Answer Extraction and Formatting: Employs Chain-of-Thought (CoT) prompting for logical reasoning, refining the extraction and formatting of answers from tabular SQL outputs.
Verification: Involves classifying responses based on format validity and relevance, minimizing errors through a verification process.
Reprocessing: Iterative refinement is adopted for flagged responses, improving extraction accuracy and alignment with expected output types.

Experimental Results

The proposed system was evaluated using SemEval 2025's DataBench and DataBench-Lite tasks. The pipeline achieved a notable accuracy of 70.5% and 71.6% on these tasks, respectively, outperforming the baseline of around 26-27%. This success is attributed to embedding-based examples enhancing query contextualization and CoT reasoning improving answer accuracy. GPT-4o demonstrated the highest accuracy, owing to its adept handling of complex query understanding.

Conclusion

The paper significantly advances LLM-driven NL-to-SQL translation with a robust multi-stage pipeline, establishing an improvement in query accuracy and reliability. Future work, indicated in the paper, involves addressing the identified categories of errors, such as complex numerical reasoning mishandling and categorical misclassification, through advancements in LLM architectures and refined NL-to-SQL frameworks. The implications of this research suggest both theoretical and practical advancements in automating database interactions and improving structured data accessibility.