SQL-Centered Agentic Framework

Updated 4 July 2026

SQL-centered agentic frameworks are systems that use SQL as the primary medium for schema grounding, intermediate reasoning, execution feedback, and auditable evidence.
They decompose the text-to-SQL problem into specialized agents handling schema extraction, planning, SQL generation, and output validation through execution-driven control loops.
These frameworks enhance accuracy and cost-efficiency by integrating modular design, execution feedback, and semantic memory to support robust enterprise deployment.

A SQL-centered agentic framework is a class of systems in which SQL is not treated merely as the terminal output of natural-language interfaces, but as the central medium for schema grounding, intermediate reasoning, execution-time verification, and, in some settings, auditable evidence. Recent work uses this paradigm in schema-aware NL2SQL systems, enterprise evaluation pipelines, SQL/Python analytical agents, and neuro-symbolic medical reasoning systems whose conclusions are explicitly tied to executable SQL traces (Cao et al., 5 Jan 2026, Onyango et al., 25 Feb 2026, Pham et al., 8 Apr 2026, Yang et al., 4 Jun 2026). The defining shift is architectural: instead of a single prompt that maps language directly to a query, the problem is decomposed into specialized agents or tool-mediated stages that plan, inspect schema, generate or revise SQL, execute against the database, and validate outputs.

1. Conceptual foundations

The formal objective remains the standard Text-to-SQL problem: given a natural-language query $q$ and schema $S$ , produce SQL $\hat{Y}$ whose execution matches the intended answer. One formulation states the goal as producing $\hat{Y}$ such that $E(\hat{Y}) = E(Y)$ , where $Y$ is the ground-truth SQL, while AV-SQL writes the task more generally as $\mathcal{Y} = f(\mathcal{Q}, \mathcal{S}, \mathcal{K} \mid \boldsymbol{\theta})$ over question, schema, and external knowledge (Onyango et al., 25 Feb 2026, Pham et al., 8 Apr 2026). What distinguishes the SQL-centered agentic variant is that the mapping $f$ is implemented as a controlled interaction loop over SQL-aware tools, intermediate representations, and execution feedback rather than as one-shot sequence generation.

A recurrent theme across the literature is that SQL-centered does not mean SQL-only. In some systems, SQL is the primary interface to relational data and the authoritative substrate for selection, joins, filtering, and coarse aggregation, while Python is introduced downstream for operations that are awkward or brittle in a single query, such as recurrences, custom analytics, or multi-stage transformations (Pham et al., 4 May 2026, Yang et al., 4 Jun 2026). This suggests that “SQL-centered” is best understood as a commitment to SQL as the canonical data-access and grounding layer, not as an exclusivity claim about downstream computation.

A second conceptual distinction concerns the role of execution. In conventional semantic parsing, execution is often a final evaluation step. In SQL-centered agentic frameworks, execution becomes part of the state-transition mechanism: the agent proposes a query, the database returns results or errors, and subsequent reasoning conditions on those observations. This is explicit in execution-gated fallback systems, multi-turn RL agents, and enterprise evaluation frameworks that score both SQL correctness and tool-use behavior (Onyango et al., 25 Feb 2026, Hua et al., 25 Jan 2026, Guo et al., 12 Oct 2025, Serrao et al., 1 Jun 2026).

2. Canonical architectural patterns

A striking feature of the literature is the convergence on modular decomposition. Specialized agents are assigned roles such as schema extraction, plan construction, SQL generation, validation, view synthesis, memory retrieval, or downstream analysis. The modules differ by domain and deployment context, but the architectural grammar is consistent: isolate schema understanding from query synthesis, isolate execution from generation, and make intermediate artifacts inspectable.

System	Specialized stages	Primary emphasis
"An Agentic System for Schema Aware NL2SQL Generation" (Onyango et al., 25 Feb 2026)	Extractor; Decomposer; Generator; Validator and Executor	Schema-aware SLM-first NL2SQL with selective LLM fallback
"OraPlan-SQL" (Liu et al., 27 Oct 2025)	Planner agent; SQL agent	Planning-centric bilingual NL2SQL with execution-result voting
"AGENTIQL" (Heidari et al., 12 Oct 2025)	Reasoning agent; Coding agent; Merge; Column Selection; Adaptive router	Multi-expert divide-and-merge with routing
"AV-SQL" (Pham et al., 8 Apr 2026)	Rewriter; View generator agents; Planner; SQL generator; Revisor	Agentic CTE views for large schemas
"ProSPy" (Yang et al., 4 Jun 2026)	Data profiling; Progressive pruning; Agentic data fetching; Python analysis	Enterprise SQL/Python framework with dialect-agnostic DSL

The schema-aware SLM-first system in particular makes the decomposition explicit: an Extractor Agent builds a schema-aware context from metadata, documentation, and evidence mappings; a Decomposer Agent converts the question into a structured execution plan; a Generator Agent emits SQL with SLM-first, LLM-fallback routing; and a Validator and Executor Agent applies evidence-based value validation, syntax validation, execution validation, and semantic validation before returning results or error diagnostics (Onyango et al., 25 Feb 2026). OraPlan-SQL reduces the architecture to two agents but sharpens the planner/compiler distinction: a Planner agent generates stepwise natural-language plans, and a SQL agent acts as a schema-aware compiler from those plans to executable SQL (Liu et al., 27 Oct 2025). AGENTIQL introduces a related but more explicitly multi-expert divide-and-merge pipeline: reasoning agent, coding agent, merge stage, column-selection refinement, and adaptive router (Heidari et al., 12 Oct 2025).

The pathology framework extends the same pattern outside benchmark NL2SQL. There, global and local Feature Reasoning Agents generate SQL over a multi-scale feature database, a Knowledge Comparison Agent scores candidate diagnoses against measured evidence, and a Report Agent fuses SQL-grounded reasoning with a CNN branch into a final auditable report (Cao et al., 5 Jan 2026). This suggests that SQL-centered agency is not limited to query answering; it can function as an explicit reasoning layer in broader neuro-symbolic systems.

3. SQL as intermediate representation, evidence, and working memory

A central design choice is to promote SQL or SQL-derived structures to first-class intermediate representations. In the schema-aware SLM-first system, the Decomposer Agent outputs a structured plan containing entities, conditions, subqueries, and output specification; this plan functions as a schema-aware intermediate representation that constrains downstream SQL generation (Onyango et al., 25 Feb 2026). OraPlan-SQL uses a stepwise natural-language plan as the intermediate representation, but that plan is already SQL-shaped: it explicitly names filters, joins, groupings, formulas, counterfactual conditions, and entity variants, and the SQL agent is expected to follow it faithfully (Liu et al., 27 Oct 2025).

AV-SQL makes the intermediate representation explicitly executable. Its “agentic views” are agent-generated CTEs that both encode intermediate logic and identify the subset of tables and columns relevant to the query. For each schema chunk, a view generator emits $\mathcal{V}^{j}_{\text{CTE}}$ together with a structured schema selection $\mathcal{J}^{j}$ ; only views that are executable and consistent with their declared table/column usage are retained, and the union of these selections yields a filtered schema for final synthesis (Pham et al., 8 Apr 2026). This turns CTEs into both computation units and schema-pruning evidence.

In the pathology setting, SQL is elevated further into an explicit trace of evidence. Feature Reasoning Agents first propose a reasoning plan, then translate plan steps into SQL over local and global feature tables. Each query is itself an inference step: FROM encodes analysis scale, WHERE expresses domain constraints, GROUP BY defines comparison structure, and aggregates such as AVG, SUM, COUNT, and STDDEV instantiate measurable evidence. The resulting SQL traces can be rerun by a pathologist to verify how cellular measurements support a diagnostic conclusion (Cao et al., 5 Jan 2026). This use of SQL as an auditable justification layer is one of the clearest formulations of the paradigm.

A common misconception is that such frameworks require all reasoning to remain inside SQL. The recent SQL/Python hybrids reject that assumption while preserving SQL centrality. FlexSQL can implement a plan in either SQL or Python, using Python for recurrences, complex analytics, JSON handling, or stateful pipelines, then transpiling winning Python programs back to SQL when needed (Pham et al., 4 May 2026). ProSPy similarly confines the database interface to SQL, but intentionally pushes complex downstream analysis into Python over materialized intermediate views, while its dialect-agnostic DSL restricts SQL generation to joins, simple aggregations, and structured conditions (Yang et al., 4 Jun 2026). In both cases, SQL remains the authoritative access layer and working memory boundary between database state and higher-order analysis.

4. Execution-driven control loops and learning

The most characteristic operational property of SQL-centered agents is execution-driven control. In the schema-aware SLM-first system, the Validator and Executor Agent acts as the gatekeeper: evidence-based value validation, syntax validation, execution validation, and semantic validation jointly determine whether an SLM-produced SQL query is accepted or escalated to GPT-4o, with up to three regeneration attempts before failure is reported (Onyango et al., 25 Feb 2026). Control is therefore rule- and tool-based rather than classifier-based; the database and validator outputs directly shape routing.

Multi-turn RL frameworks internalize this loop. SQL-Trail models each episode as a trajectory $S$ 0, where observations are execution outputs or invalid-action messages and actions are reasoning blocks plus SQL tool calls. Its total reward is explicitly composite,

$S$ 1

so correctness, efficiency, schema linking, structural similarity, executability, and interface adherence are jointly optimized (Hua et al., 25 Jan 2026). MTSQL-R1 casts multi-turn conversational Text-to-SQL as an MDP with actions $S$ 2, $S$ 3, $S$ 4, $S$ 5, $S$ 6, and $S$ 7, coupling database execution with persistent dialogue memory so that coherence across turns is explicitly verified before a query is finalized (Guo et al., 12 Oct 2025).

Execution-driven learning also appears in offline form. ExeSQL treats the database engine as the environment in a bootstrapping loop: candidate SQL for a target dialect is generated, executed, retained in $S$ 8 if successful, or placed in $S$ 9 if not, with later DPO training enforcing $\hat{Y}$ 0 for executable over failed queries (Zhang et al., 22 May 2025). AGRO-SQL extends the same idea to agentic RL: after a diversity-aware cold start, groups of candidate trajectories are executed, rewarded, and updated via GRPO using group-relative advantages, while a high-fidelity synthesis pipeline attempts to eliminate semantically misaligned “gold” SQL before RL begins (Yang et al., 29 Dec 2025).

These systems jointly establish a broad pattern: SQL generation is most robust when the database is treated as an active environment with observable consequences, not merely a final evaluator. The agentic contribution lies less in adding more language-only reasoning than in binding reasoning to executable, state-altering interactions.

5. Semantic memory and unified evaluation

As SQL-centered agents lengthen their trajectories, two additional problems emerge: repeated exploration and inadequate evaluation. AgentSM addresses the first by storing structured semantic memory of prior trajectories. Rather than retaining raw scratchpads, it stores phase-tagged execution traces in structured Markdown or JSON, indexes them by database and semantic similarity, and retrieves a relevant trajectory before planning a new query. On Spider 2.0, this design reduces average token usage and trajectory length by 25% and 35%, respectively, and reaches 44.8% execution accuracy on Spider 2.0 Lite (Biswal et al., 22 Jan 2026). The underlying thesis is that enterprise SQL agents repeatedly rediscover the same schema facts, join paths, and exploration routines unless those routines are turned into reusable semantic memory.

Evaluation frameworks have evolved in parallel. A common baseline remains execution accuracy,

$\hat{Y}$ 1

where $\hat{Y}$ 2 and $\hat{Y}$ 3 are the execution results of gold and predicted SQL, respectively (Onyango et al., 25 Feb 2026). BADGER argues that this is insufficient in enterprise settings because aliasing, numeric tolerances, dialect features, and agentic tool behavior all matter. It therefore combines LLM-assisted SQL component extraction, a two-stage Hybrid-EX metric, and an agentic evaluation suite spanning Tool Recall, Tool Order, Excess Tool Usage, faithfulness, G-Eval summary quality, and intent resolution (Serrao et al., 1 Jun 2026). On 150 human-annotated industry queries, Hybrid-EX achieves Cohen’s $\hat{Y}$ 4 with 87.3% balanced accuracy, outperforming all six competing frameworks evaluated in that study (Serrao et al., 1 Jun 2026).

A plausible implication is that SQL-centered agentic frameworks require dual evaluation layers. One layer measures executable correctness and efficiency of the SQL substrate; the other measures whether the agent’s broader behavior—tool calling, summarization, clarification, and decomposition—remains aligned with expert expectations. BADGER makes this layering explicit, and AgentSM shows that evaluation artifacts can themselves become memory objects that improve future reasoning.

6. Cost, scheduling, and enterprise deployment

A major rationale for SQL-centered agency is operational, not only algorithmic. The schema-aware SLM-first system reports 47.78% execution accuracy and 51.05% VES on the BIRD benchmark while achieving over 90% cost reduction relative to LLM-centric baselines: approximately 67% of queries are handled by local SLMs, average cost per query is \$\hat{Y}$50.094 for LLM-only systems, and locally executed queries incur near-zero operational cost (Onyango et al., 25 Feb 2026). The key mechanism is validation-gated escalation: expensive models are invoked only after concrete failure signals.

The Datalake Agent addresses a related cost problem in multi-database settings where the relevant schema is not known a priori. Rather than passing all schema metadata in one prompt, it exposes GetDBDescription, GetTables, GetColumns, and DBQueryFinalSQL as tools in an interactive loop. On 23 databases and 319 tables, average token usage remains around 4,264 tokens per task, while the direct baseline rises to 34,602; overall token usage is reduced by up to 87%, and the direct solver can become up to 8x more expensive at large scale (Jehle et al., 16 Oct 2025). This makes metadata navigation itself an agentic subproblem.

At serving time, HEXGEN-TEXT2SQL treats an agentic SQL workflow as a schedulable DAG of dependent LLM calls running on heterogeneous GPU clusters. Its global coordinator dispatches tasks using a workload-balanced score, while local model instances prioritize requests by urgency relative to per-query SLO budgets. On realistic CHESS-style workflows over BIRD traces, it reduces latency deadlines by up to $\hat{Y}$ 6 with an average of $\hat{Y}$ 7, and improves throughput by up to $\hat{Y}$ 8 with an average of $\hat{Y}$ 9 relative to vLLM (Peng et al., 8 May 2025). This operationalizes a point implicit in many papers: once SQL reasoning becomes genuinely agentic, orchestration and scheduling become part of the framework rather than external deployment details.

Enterprise-focused systems reinforce the same trend. BADGER is designed to run entirely inside a governed client environment, while ProSPy achieves 60.15% on Spider 2.0-Lite and 60.51% on Spider 2.0-Snow with Claude-4.5-Opus without majority voting, emphasizing robustness to dialect variation and a favorable recall–precision trade-off in schema pruning (Serrao et al., 1 Jun 2026, Yang et al., 4 Jun 2026). The literature therefore increasingly treats deployability, governance, and latency as first-class design objectives alongside execution accuracy.

7. Limitations and open directions

Despite rapid progress, the current generation of SQL-centered agentic frameworks retains several recurring failure modes. The schema-aware SLM-first system reports performance degradation on nested subqueries, joins, and complex temporal reasoning, along with cascading error propagation when upstream decomposition is wrong and significant tail latency when LLM fallback is triggered repeatedly (Onyango et al., 25 Feb 2026). SQL-Trail notes that multi-turn interaction raises inference cost and requires interactive database access, while reward shaping can bias the policy toward gold-like surface forms rather than true semantic equivalence (Hua et al., 25 Jan 2026). ExeSQL’s binary execution reward has the complementary weakness that semantically wrong but executable SQL can still receive positive reinforcement during training (Zhang et al., 22 May 2025).

Memory and evaluation frameworks also expose unresolved issues. AgentSM improves schema exploration reuse, but the paper indicates that it helps more with repetitive database-specific patterns than with deep mathematical or logical reasoning, and that nested Snowflake schemas remain especially difficult (Biswal et al., 22 Jan 2026). BADGER reports that its annotated evaluation corpus is still limited, that pre-captured result tables cannot detect future schema drift, and that standard multi-turn conversational and multimodal RAG evaluation remain incomplete (Serrao et al., 1 Jun 2026). ProSPy acknowledges that errors in intermediate SQL views propagate into Python analysis, and that schema linking dominates runtime, accounting for 70–85% of online time in its measurements (Yang et al., 4 Jun 2026).

The research frontier is therefore shifting from basic decomposition toward more selective and better-calibrated agency. Concrete directions already identified in the literature include complexity-aware routing, privacy-preserving fallback with masking of sensitive content, DAG-based tool-order evaluation, multimodal validation, live execution rather than pre-captured tables, richer semantic and cost-aware rewards, and broader domain coverage across manufacturing, logistics, and life sciences (Onyango et al., 25 Feb 2026, Serrao et al., 1 Jun 2026, Yang et al., 4 Jun 2026). A plausible implication is that future SQL-centered agentic frameworks will be judged less by whether they are agentic at all than by how precisely they control when to explore, when to execute, when to escalate, and when to stop.