PV-SQL: Synergizing Database Probing and Rule-based Verification for Text-to-SQL Agents

Published 19 Apr 2026 in cs.AI and cs.DB | (2604.17653v1)

Abstract: Text-to-SQL systems often struggle with deep contextual understanding, particularly for complex queries with subtle requirements. We present PV-SQL, an agentic framework that addresses these failures through two complementary components: Probe and Verify. The Probe component iteratively generates probing queries to retrieve concrete records from the database, resolving ambiguities in value formats, column semantics, and inter-table relationships to build richer contextual understanding. The Verify component employs a rule-based method to extract verifiable conditions and construct an executable checklist, enabling iterative SQL refinement that effectively reduces missing constraints. Experiments on the BIRD benchmarks show that PV-SQL outperforms the best text-to-SQL baseline by 5% in execution accuracy and 20.8% in valid efficiency score while consuming fewer tokens.

Abstract PDF Upgrade to Chat

Authors (2)

Summary

The paper introduces a dual-phase method that combines targeted database probing with rule-based verification to overcome semantic misinterpretation in text-to-SQL tasks.
Empirical results on benchmarks like BIRD, Mini-Dev, and Spider reveal significant gains, including a 65.12% execution accuracy and reduced synthesis errors.
The PV-SQL framework offers high interpretability and efficiency, making it applicable for real-world scenarios and cost-sensitive deployments.

PV-SQL: Synergizing Database Probing and Rule-based Verification for Text-to-SQL Agents

Introduction

The challenge of mapping natural language (NL) queries to SQL in the context of unseen database schemas persists despite the strong few-shot learning capabilities of current LLMs. The primary failure modes in text-to-SQL agents remain semantic: schema misinterpretation, insufficient value grounding, and incomplete constraint satisfaction. The PV-SQL agentic framework introduces a dual-stage method—explicit database probing and rule-based verification/refinement—targeted at systematically eliminating these weaknesses.

Figure 1: An example of how PV-SQL effectively solves a text-to-SQL task, demonstrating the synergy of content probing and verification.

Architecture and Methodology

PV-SQL’s architecture explicitly decomposes agent operation into two orthogonal but synergistic phases: Probe and Verify. The Probe phase addresses semantic grounding via targeted exploratory queries to the database, intended to resolve context not encoded in schemas—e.g., value representation, functional relationships, and key distributional features. The Verify stage extracts verifiable constraints via deterministic pattern matching from the NL question and iteratively repairs the LLM-generated SQL until all constraints (and database execution) are satisfied.

Figure 2: Overview of PV-SQL, showing the iterative cycle of probing for database-specific evidence and applying rule-based verification for semantic constraint enforcement.

Probe generation follows an uncertainty-driven, iterative querying schema with an explicit cap, terminating per-agent decision. Results are accumulated in the agent context for downstream SQL synthesis. The Verify module leverages a static rule-set mapping NL patterns (e.g., “unique”, “top 3”, “average”, “latest”, comparatives) to executable SQL constraints and performs three diagnostic steps: syntax validation, execution error detection, and clause presence checking for each constraint. Deficient queries are repaired via targeted LLM calls, retaining all prior evidence. This deterministic repair loop is bounded to maximize efficiency and prevent infinite regress.

Experimental Results

PV-SQL establishes clear gains across strong baselines (DAIL-SQL, DIN-SQL, MAC-SQL, E-SQL, TA-SQL, XiYan-SQL, TS-SQL) on three public benchmarks (BIRD, Mini-Dev, Spider) and six LLMs (GPT-4o, GPT-4.1-mini, GPT-OSS-20B, Gemma3-4B, Qwen3-4B, Qwen3-0.6B). When using GPT-4o, PV-SQL achieves a 65.12% execution accuracy and 75.55 VES on BIRD—a substantial improvement over the strongest baseline.

Ablation confirms both components' necessity: removing probing reduces EX by 3–4 points and VES by 4–6, while disabling rule-based repair causes similar degradation. Replacing rule verification with LLM-based verification introduces more compute overhead and reduces EX by 6%, demonstrating PV-SQL’s explicit checklists’ superior coverage/efficiency tradeoff.

Figure 3: Accuracy vs. token consumption on BIRD. PV-SQL achieves the highest accuracy at moderate compute cost.

Task difficulty breakdown reveals PV-SQL’s largest gains occur on challenging queries (+9% over competitors), capturing more semantic intricacies that break static or self-refinement approaches.

Error Analysis

Underlying performance improvements are validated through systematic error classification. On BIRD, baseline agents are dominated by database misinterpretation (41.3%) and synthesis errors (33.9%), with question ambiguity contributing 24.8%. PV-SQL reduces database misinterpretation errors by 42% and synthesis errors by 19%, while only slightly reducing question ambiguity due to inherent linguistic limitations.

Figure 4: Error distribution with and without PV-SQL: database misinterpretation and synthesis errors are substantially reduced by probe and verify loops.

Component analysis shows 99.39% accuracy in constraint extraction and 90.8% repair success rate on violated constraints with minimal regressions, confirming the reliability of the rule-driven verification-repair pipeline.

Practical and Theoretical Implications

PV-SQL formalizes a general, compositional paradigm for agentic task design: explicit evidence gathering (Probe) and verifiable synthesis (Verify) using rigidly specified feedback channels. This is a departure from both passive context enrichment (as in static schema linking or demonstration retrieval) and unconstrained self-refinement protocols. Crucially, database probing mitigates schema mismatch, data value idiosyncrasies, and mitigates out-of-distribution failures typical in real-world deployment. Rule-based verification and iterative repair provide a high-precision semantic filter that is not achievable with LLM-quality self-checking. This results in less reliance on fragile emergent self-diagnosis and provides deterministic, interpretable error localization and correction.

Practically, PV-SQL’s efficiency—achieving SOTA accuracy with fewer tokens and lower latency than alternatives—makes it deployable in cost-sensitive settings and supports compositional extension (e.g., custom domains via additional extraction rules, integration with code synthesis for mixed tasks). The pattern-centric checklists can scale and adapt, and the Probing methodology is transferable to any agentic pipeline requiring content grounding beyond abstract schema-level reasoning.

Theoretically, PV-SQL demonstrates that lightweight, explainable verification mechanisms composed with targeted environment feedback (database probes) can outperform black-box, context-only, or LLM-only ensembles, particularly as task complexity and domain gap increase.

Limitations and Future Work

PV-SQL’s constraint extraction and verification are limited by the precision-oriented pattern set; domain- or task-specific constraints outside the current rule taxonomy are deferred to LLM synthesis. Expanding and stratifying constraint coverage remains a target; automated rule induction from usage corpora may yield further robustness. Additionally, the approach’s component modularity is ideal for competitive enhancement—for instance, replacement with static analysis-, type system-, or contrastive-data-driven verifiers.

Conclusion

PV-SQL advances text-to-SQL agents by combining targeted, agentic database probing with high-precision, rule-based verification and repair. It achieves consistently superior accuracy and reliability across models and benchmarks, with practical efficiency and distinct error reduction mechanisms. The framework crystallizes a general methodology for building robust agent architectures grounded in explicit evidence and verifiable synthesis, serving as a foundation for future AI system design.

Markdown Report Issue