Querying Structured Data Through Natural Language Using Language Models

Published 3 Apr 2026 in cs.CL and cs.AI | (2604.03057v1)

Abstract: This paper presents an open source methodology for allowing users to query structured non textual datasets through natural language Unlike Retrieval Augmented Generation RAG which struggles with numerical and highly structured information our approach trains an LLM to generate executable queries To support this capability we introduce a principled pipeline for synthetic training data generation producing diverse question answer pairs that capture both user intent and the semantics of the underlying dataset We fine tune a compact model DeepSeek R1 Distill 8B using QLoRA with 4 bit quantization making the system suitable for deployment on commodity hardware We evaluate our approach on a dataset describing accessibility to essential services across Durangaldea Spain The fine tuned model achieves high accuracy across monolingual multilingual and unseen location scenarios demonstrating both robust generalization and reliable query generation Our results highlight that small domain specific models can achieve high precision for this task without relying on large proprietary LLMs making this methodology suitable for resource constrained environments and adaptable to broader multi dataset systems We evaluate our approach on a dataset describing accessibility to essential services across Durangaldea Spain The fine tuned model achieves high accuracy across monolingual multilingual and unseen location scenarios demonstrating both robust generalization and reliable query generation Our results highlight that small domain specific models can achieve high precision for this task without relying on large proprietary LLMs making this methodology suitable for resource constrained environments and adaptable to broader multi dataset systems.

Abstract PDF Upgrade to Chat

Authors (4)

Summary

The paper’s main contribution is a query execution pipeline that transforms natural language prompts into executable structured queries using a compact LLM.
It leverages a synthetic data pipeline with automated paraphrasing and 44,849 question–answer pairs to robustly capture diverse querying intentions.
The system utilizes 4-bit quantization and QLoRA fine-tuning on commodity hardware, achieving competitive accuracy with models like GPT-4.

Structured Data Querying via Natural Language with Small LLMs

Model Motivation and System Architecture

The paper addresses a critical challenge in deploying LLM-based querying systems for structured, non-textual datasets: the inadequacy of RAG techniques for numerical and tabular data. The proposed open-source methodology uses a compact LLM—specifically DeepSeek R1-Distill-8B—fine-tuned via QLoRA with 4-bit quantization, to generate executable queries in response to natural language prompts. The architecture is designed to operate with commodity hardware, enabling practical deployment in resource-limited environments while eschewing large, closed-source models.

A key architectural innovation is the shift from direct answer generation or text retrieval to an explicit query execution pipeline. Upon receiving a user question, the model generates a structured query reflecting both the intent and the semantics of the relevant dataset fragment. The query is then executed, and the returned data is incorporated into the final natural language output, ensuring robust and precise access to complex, structured information.

Figure 1: Overview of the model inference and query generation process.

Synthetic Training Pipeline and Dataset Construction

Sufficient, high-quality training data for query generation is typically unavailable for specialized domains. The paper presents a systematic pipeline for synthetic dataset creation:

All semantically meaningful projections over the dataset are identified.
For each projection, question templates are constructed, capturing diverse querying intentions and linguistic expressions.
Instantiation is performed over a representative sample of locations from the Durangaldea region using the Overpass API, producing 44,849 question–answer pairs.
Answers are paired with corresponding executable queries.
Paraphrasing is automated using LLMs (Gemini 2.5 Pro and DeepSeek R1) to induce robust generalization to linguistic variability.

The dataset covers a range of spatial and access-related queries, ensuring strong coverage for downstream query generation.

Fine-tuning with QLoRA and Model Evaluation

The DeepSeek R1-Distill-8B model was fine-tuned under strong memory constraints using QLoRA. The approach updates approximately 10M adapter parameters (out of 8B model parameters) while leveraging 4-bit quantization to limit GPU memory footprint. Training employed early stopping to prevent overfitting and used standard metrics (BLEU-4, ROUGE-L, and Exact Match) to track convergence over a held-out validation set.

Figure 2: Evaluation metrics during fine-tuning; loss decreases and stabilizes early, indicative of task structure acquisition and minimal overfitting.

Evaluation on monolingual, multilingual, and unseen-location test sets demonstrates:

High syntactic precision and structural alignment: BLEU-4 and ROUGE-L often near 0.97–0.99.
Strong generalization to new locations (Exact Match 89%) and paraphrased queries (94.2%).
Multilingual performance is uneven: strong in major European languages (e.g., Spanish, Catalan, Galician), but severely degraded for Basque—reflecting foundation model pre-training composition.
Competitive accuracy against SOTA LLMs (GPT-4, Gemini, Grok) despite model scale and hardware constraints.

Application Integration and Performance Analysis

The fine-tuned model is instantiated within a web application supporting both free-form and template-based querying. UX incorporates geolocation (Google Maps) and automatic mapping of street-level queries to dataset coordinates. Guardrails are implemented to ensure safety and enforce query validity, contributing negligibly to latency.

Figure 3: Web application for model queries provides a practical interface for accessibility data exploration.

Performance profiling reveals:

Model inference dominates runtime, averaging approximately three seconds per request on an RTX 3090 GPU.
Backend data lookup and logic, including guardrail execution, contribute minimally to overall latency.
Figure 4: Application performance profiling demonstrates that inference, rather than data access or business logic, is the primary bottleneck.

The system demonstrates that high-fidelity querying over specialized datasets can be realized with low-cost hardware by focusing on model compactness and task-specific fine-tuning.

The proposed methodology is situated relative to prior work:

Toolformer and Gorilla enable LLMs to interface with diverse external tools or APIs, but require massive models and extensive, often unavailable, annotated corpora, or rely on retrieval-augmented grounding for API selection.
The presented pipeline optimizes for high precision within a narrow domain, and leverages a lightweight, dataset-specific synthetic corpus, enabling efficient tuning without large-scale retrieval, context injection, or multi-tool complexity.
ReAct and similar iterative tool-use paradigms impose additional reasoning and computational overhead for multi-step tool invocation; the presented work circumvents this by restricting to single call, single-dataset workflows with minimal context loading.

The approach thus addresses the complexity, context window, and resource bottlenecks of more general tool-using LLM architectures.

Limitations and Future Directions

Key limitations arise primarily in scalability to multi-dataset or strongly multi-domain deployments:

Manual template and projection enumeration will not scale for exponentially expanding query spaces across combined datasets.
The current application does not address high-concurrency workload scaling; batch inference, aggressive quantization, pruning, and response caching are identified as practical system-level mitigations.
Error recovery mechanisms are rudimentary; future iterations could leverage iterative prompt-error feedback for self-correction.

Potential generalizations include the use of ontology-based abstraction, LLM-driven topic space enumeration, and more comprehensive multilingual adaptation protocols.

Conclusion

This work establishes that high-accuracy, low-latency natural language querying of structured data is achievable using compact open-source LLMs fine-tuned with minimal specialized hardware. By introducing a principled, synthetic data generation and fine-tuning pipeline, the approach bridges a critical deployment gap between proprietary high-resource LLMs and practical, domain-targeted question-answering systems. Extension to larger-scale, heterogeneous, or dynamic dataset environments remains a promising area for ongoing research, with implications for democratizing access to data-driven analytics and decision support across a wider variety of real-world contexts.

Reference: "Querying Structured Data Through Natural Language Using LLMs" (2604.03057)

Markdown Report Issue