Papers
Topics
Authors
Recent
2000 character limit reached

ReFoRCE Agent: Advanced Text-to-SQL

Updated 24 November 2025
  • ReFoRCE Agent is a Text-to-SQL system that uses schema compression and LLM-guided linking to manage large, heterogeneous enterprise databases.
  • It leverages iterative self-refinement, error feedback, and CTE decomposition to generate accurate multi-dialect SQL queries.
  • Consensus enforcement through parallel candidate voting and iterative column exploration ensures robust performance on benchmarks like Spider 2.0.

ReFoRCE is a Text-to-SQL agent designed to address the challenges of deploying natural language interfaces over large, heterogeneous, enterprise-grade databases, particularly those characterized by schema sizes exceeding 1,000 columns, a diversity of SQL dialects (such as BigQuery and Snowflake), and complex analytical query requirements. Distinct from prior approaches, ReFoRCE incorporates aggressive schema information compression, LLM-guided schema linking, iterative self-refinement, parallel consensus voting, and iterative column exploration. These mechanisms result in state-of-the-art performance on the Spider 2.0 benchmark, including its challenging “Snow” and “Lite” subsets (Deng et al., 2 Feb 2025).

1. Database Information Compression and Schema Linking

ReFoRCE implements a multi-stage pipeline to reduce input context length and enhance LLM relevance for large schemas. The first stage groups tables by common prefixes or suffixes using a string-prefix similarity:

simprefix(ti,tj)=LongestCommonPrefix(name(ti),name(tj))\mathrm{sim}_{\mathrm{prefix}}(t_i, t_j) = |\mathrm{LongestCommonPrefix}(\mathrm{name}(t_i), \mathrm{name}(t_j))|

Tables are merged if their prefix (or suffix) similarity exceeds a threshold τ\tau, with only the lexicographically first table’s DDL (Data Definition Language) retained for each group. For example, the collection T={t1,,tN}\mathcal{T} = \{t_1,\dots,t_N\} is partitioned into groups C={C1,,CM}\mathcal{C} = \{C_1,\dots,C_M\}, yielding the compressed representation

compress1(T)={DDL(rCk):k=1M}{name(t):tkCk}\mathrm{compress}_1(\mathcal{T}) = \{ \mathrm{DDL}(r_{C_k}) : k = 1 \dots M \} \cup \{ \mathrm{name}(t) : t \notin \cup_k C_k \}

Empirically, the compressed token length LcL_c satisfies Lc/L01L_c/L_0 \ll 1 (typically 0.1–0.5 for very large databases). LLM-guided schema linking then selects only contextually relevant tables for the query QQ via a lightweight LLM session, asking, “Which of these tables are likely to be needed to answer QQ?” Only DDLs for top KK tables, as identified by the LLM, are surfaced in detail, and the initial prompt is

Pinit=compress(D)+E+QP_{\rm init} = \mathrm{compress}(\mathcal{D}) + \mathcal{E} + Q

where E\mathcal{E} denotes supplementary context.

2. Self-Refinement: Iterative Correction and CTE Decomposition

ReFoRCE integrates an error-driven, dialect-agnostic self-refinement loop for SQL generation. The routine executes as follows:

  1. Initialize k0k \leftarrow 0 and history H[]H \leftarrow [].
  2. Generate SQL candidate S(k)S^{(k)} via S(k)=LLM(Pinit,F,H)S^{(k)} = \mathrm{LLM}(P_\mathrm{init}, F, H) with FF as the expected output format.
  3. Execute S(k)S^{(k)} in the target engine (e.g., BigQuery).
  4. If R(k)R^{(k)} is a valid, non-empty result matching FF, and occurs twice in history (logical equivalence), output S(k)S^{(k)}; otherwise, continue.
  5. Errors δ(k)\delta^{(k)} identified in execution (syntax, type, etc.) are appended to HH, and kk is incremented.
  6. Terminate after KmaxK_{max} iterations or if three consecutive failures occur.

The update is formalized as

e(k)=exec(S(k),I), δ(k)=parse_error(e(k)), S(k+1)=LLM(Pinit,F,(S(0),δ(0),,S(k),δ(k)))\begin{aligned} e^{(k)} &= \mathrm{exec}(S^{(k)}, \mathcal{I}),\ \delta^{(k)} &= \mathrm{parse\_error}(e^{(k)}),\ S^{(k+1)} &= \mathrm{LLM}(P_{\rm init}, F, (S^{(0)}, \delta^{(0)}, \dots, S^{(k)}, \delta^{(k)})) \end{aligned}

A CTE-based fallback decomposes SQL into named Common Table Expressions upon repeated failures, localizing error correction to fragments with mismatched or unexpected intermediate outputs.

3. Consensus Enforcement and Candidate Aggregation

ReFoRCE employs a parallel workflow for robustness via consensus. TT independent threads (with T=3T=3) execute the complete pipeline (compression, exploration, refinement) with stochastic sampling, producing {R(1),R(2),,R(T)}\{R^{(1)}, R^{(2)}, \dots, R^{(T)}\}. The majority candidate RR^* is selected by

R=arg maxqϕ(q)R^* = \argmax_q \phi(q)

where ϕ(q)\phi(q) counts supporting threads. Ties trigger a fallback LLM prompt for plausibility ranking. Alternatively, a soft voting procedure integrates model-assigned scores sjs_j for each candidate QjQ_j:

score(q)=j:Qj=qsj,q=arg maxqscore(q)\mathrm{score}(q) = \sum_{j: Q_j = q} s_j, \qquad q^* = \argmax_q \mathrm{score}(q)

This ensemble process increases robustness against spurious LLM generations and aids in disambiguating ambiguous queries.

4. Iterative Column Exploration and Feedback

Column exploration precedes final SQL synthesis to discover relevant features:

  1. An auxiliary LLM session, primed with PinitP_{\rm init} and an exploration prompt, generates up to 10 candidate table.column references accompanied by simple selection queries (SELECT DISTINCT\mathrm{SELECT\ DISTINCT}, limited to 100 rows, no CTEs).
  2. Each candidate query SiexpS^{\mathrm{exp}}_i is executed; errors invoke a corrective sub-loop (max three retries per failed SQL).
  3. On five consecutive failures, the exploration phase aborts.
  4. Successfully retrieved sample values RexpR^{\mathrm{exp}} are joined into the main prompt context for the self-refinement phase.

The correction algorithm retries failed queries by providing failure feedback (“Your query failed with error EE. Please correct it.”) to the LLM and replaces initial statements until execution succeeds or the retry limit is reached.

5. Empirical Performance and Benchmark Analysis

ReFoRCE demonstrates new state-of-the-art execution accuracy (EX) on the Spider 2.0 benchmark and its Snowflake and Lite subsets. The following table reports key results:

Method Model Snowflake (EX) Lite (EX)
ReFoRCE (ours) o1-preview 26.69 24.50
Spider-Agent o1-preview 20.29 20.66
Spider-Agent GPT-4o 10.79 10.87
DAIL-SQL GPT-4o 2.20 5.68
CHESS GPT-4o 1.28 3.84
DIN-SQL GPT-4o 0.00 1.46
SFT CodeS-15B 0.00 0.73

ReFoRCE improvements of approximately 6–7 percentage points (pp) over Spider-Agent highlight the impact of comprehensive table compression, format restriction, iterative column exploration, and self-refinement mechanisms. The marginal accuracy drop for the “Lite” task, which includes BigQuery and SQLite in addition to Snowflake, indicates minimal dialect-specific engineering overhead.

6. Components and Design Integration

ReFoRCE’s architecture can be characterized as a composition of the following principal mechanisms:

  • Pattern-based schema compression and LLM-guided schema linking for scalable context management.
  • Strict output format restrictions to maintain compliance with execution environments.
  • A history-augmented self-refinement pipeline, with dialect-specific corrective strategies.
  • Iterative column exploration enabled by execution-based feedback and LLM remediations.
  • Parallelized consensus enforcement through majority voting and confidence-based aggregation.

This multi-layered integration enables ReFoRCE to robustly and efficiently translate natural language queries into multi-dialect SQL, addressing the pathological long-context and semantic mapping challenges posed by large-scale structured data systems (Deng et al., 2 Feb 2025).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)
Slide Deck Streamline Icon: https://streamlinehq.com

Whiteboard

Forward Email Streamline Icon: https://streamlinehq.com

Follow Topic

Get notified by email when new papers are published related to ReFoRCE Agent.