MTSQL-R1: Agentic Text-to-SQL Framework
- The paper presents an agentic training framework that employs an MDP to iteratively propose, execute, verify, and refine SQL queries in multi-turn dialogues.
- It integrates execution feedback and persistent dialogue memory to ensure query validity and cross-turn coherence by checking both syntax and context.
- Empirical results on CoSQL and SParC benchmarks demonstrate significant improvements in execution accuracy and exact match over traditional one-shot methods.
MTSQL-R1 is an agentic training framework for long-horizon multi-turn Text-to-SQL generation, designed to enable models to translate conversational user utterances into executable SQL queries while ensuring coherence across dialogue turns and effective grounding to the underlying database schema. Distinct from canonical approaches—where each turn is treated as an independent translation—the MTSQL-R1 system views query formulation as a sequential decision process, employing iterative proposal, execution, verification, and refinement cycles. This design directly addresses critical issues in conversational semantic parsing, such as non-executable queries and cross-turn incoherence. MTSQL-R1 leverages explicit interaction with both the target database and a persistent dialogue memory, formalizing its internal reasoning steps as a Markov Decision Process (MDP).
1. Agentic Formulation and Markov Decision Process (MDP)
MTSQL-R1 formalizes multi-turn Text-to-SQL as a Markov Decision Process, enabling the agent to engage in interactive, environment-driven reasoning. The state at each step is:
where denotes the dialogue history, the schema, the current utterance, the long-term dialogue memory, the intermediate SQL query, and the accumulated observations (e.g., execution feedback). The agent's action space consists of
- PROPOSE (formulate a candidate SQL query),
- EXECUTE (run the query against the database),
- E-VERIFY (validate against execution feedback),
- M-VERIFY (dialogue memory coherence check),
- SELF-CORRECT (refine the query upon failure),
- FINALIZE (commit the query if checks pass).
The action transitions are formalized as:
This structure explicitly models both forward progression and iterative backtracking via verification and correction.
2. Execution Feedback and Environment-Driven Verification
A central component of MTSQL-R1 is environment-driven verification, realized by executing the candidate SQL query against a live relational database. Execution feedback includes syntactic validity (does the query parse?), semantic validity (does the query return expected results or meaningful rows?), and detailed error messages. Such feedback is incorporated as observations , enabling the agent to localize errors—whether in SQL formulation or schema referencing—and to iteratively adapt subsequent proposals. This external grounding is critical for robustness, significantly reducing the frequency of non-executable queries and enhancing practical deployability for data-centric dialogue systems.
3. Dialogue Memory and Coherence Verification
In addition to execution feedback, MTSQL-R1 maintains a persistent dialogue memory . This memory tracks previous user utterances, generated SQL queries, and associated constraints (such as entity references, prior logical forms, and conversational context). During M-VERIFY steps, the agent checks whether the newly generated SQL query preserves cross-turn constraints and maintains conversation coherence. This mechanism prevents typical pitfalls, such as forgetting prior selections or violating contextually established relations, and maintains semantic alignment throughout the dialogue.
4. Iterative Reasoning: Propose → Execute → Verify → Refine
MTSQL-R1’s decision protocol explicitly supports iterative refinement. The agent first proposes a SQL candidate, executes it, analyzes feedback, and either finalizes the output or corrects errors by re-entering the cycle. The self-correction mechanism actively produces refined SQL in response to failed verification, utilizing both execution errors and memory mismatches as guidance. The process continues until both environment (database feedback) and memory (coherence check) pass, guaranteeing both executability and cross-turn consistency.
5. Empirical Evaluation on Conversational Benchmarks
MTSQL-R1 is empirically validated on conversational semantic parsing benchmarks—CoSQL and SParC. Performance metrics include Execution Accuracy (EX), which evaluates whether generated SQL both parses and produces correct results, and Exact Match (EM), which checks for string-level correspondence with reference SQL. MTSQL-R1 exhibits consistent improvements over strong baselines, including prompting-based and fine-tuned models, even when instantiated with modestly sized open-source LLMs (1.7B–4B parameters). The framework’s explicit iterative verification mechanism is pivotal in achieving these gains, with reductions in incoherent and non-executable outputs across dialogue turns.
6. Contribution to Community Research and Open Recipes
A significant aspect of MTSQL-R1 is its comprehensive resource release, including codebase, trained models, detailed logs, and reasoning trajectories. These artifacts enable reproducibility and encourage further exploration of agentic reasoning, persistent memory integration, and iterative verification in multi-turn semantic parsing. The release of full "recipes" provides granular insight into environment-interactive learning processes, supporting transparent progress in conversational Text-to-SQL research.
7. Significance and Technical Impact
MTSQL-R1 redefines long-horizon multi-turn Text-to-SQL by shifting from short-horizon, one-shot translation paradigms to interactive, agentic training. Its MDP formulation, together with environment-driven feedback and memory-guided refinement, yields SQL queries that are both executable and contextually coherent over extended dialogues. This structure addresses critical limitations in prior semantic parsing systems and sets a foundation for future research in interactive natural language interfaces to relational databases, especially in scenarios demanding robust multi-turn conversational grounding (Guo et al., 12 Oct 2025).