ReFoRCE Agent: Advanced Text-to-SQL

Updated 24 November 2025

ReFoRCE Agent is a Text-to-SQL system that uses schema compression and LLM-guided linking to manage large, heterogeneous enterprise databases.
It leverages iterative self-refinement, error feedback, and CTE decomposition to generate accurate multi-dialect SQL queries.
Consensus enforcement through parallel candidate voting and iterative column exploration ensures robust performance on benchmarks like Spider 2.0.

ReFoRCE is a Text-to-SQL agent designed to address the challenges of deploying natural language interfaces over large, heterogeneous, enterprise-grade databases, particularly those characterized by schema sizes exceeding 1,000 columns, a diversity of SQL dialects (such as BigQuery and Snowflake), and complex analytical query requirements. Distinct from prior approaches, ReFoRCE incorporates aggressive schema information compression, LLM-guided schema linking, iterative self-refinement, parallel consensus voting, and iterative column exploration. These mechanisms result in state-of-the-art performance on the Spider 2.0 benchmark, including its challenging “Snow” and “Lite” subsets (Deng et al., 2 Feb 2025).

1. Database Information Compression and Schema Linking

ReFoRCE implements a multi-stage pipeline to reduce input context length and enhance LLM relevance for large schemas. The first stage groups tables by common prefixes or suffixes using a string-prefix similarity:

$\mathrm{sim}_{\mathrm{prefix}}(t_i, t_j) = |\mathrm{LongestCommonPrefix}(\mathrm{name}(t_i), \mathrm{name}(t_j))|$

Tables are merged if their prefix (or suffix) similarity exceeds a threshold $\tau$ , with only the lexicographically first table’s DDL (Data Definition Language) retained for each group. For example, the collection $\mathcal{T} = \{t_1,\dots,t_N\}$ is partitioned into groups $\mathcal{C} = \{C_1,\dots,C_M\}$ , yielding the compressed representation

$\mathrm{compress}_1(\mathcal{T}) = \{ \mathrm{DDL}(r_{C_k}) : k = 1 \dots M \} \cup \{ \mathrm{name}(t) : t \notin \cup_k C_k \}$

Empirically, the compressed token length $L_c$ satisfies $L_c/L_0 \ll 1$ (typically 0.1–0.5 for very large databases). LLM-guided schema linking then selects only contextually relevant tables for the query $Q$ via a lightweight LLM session, asking, “Which of these tables are likely to be needed to answer $Q$ ?” Only DDLs for top $K$ tables, as identified by the LLM, are surfaced in detail, and the initial prompt is

$P_{\rm init} = \mathrm{compress}(\mathcal{D}) + \mathcal{E} + Q$

where $\mathcal{E}$ denotes supplementary context.

ReFoRCE integrates an error-driven, dialect-agnostic self-refinement loop for SQL generation. The routine executes as follows:

Initialize $k \leftarrow 0$ and history $H \leftarrow []$ .
Generate SQL candidate $S^{(k)}$ via $S^{(k)} = \mathrm{LLM}(P_\mathrm{init}, F, H)$ with $F$ as the expected output format.
Execute $S^{(k)}$ in the target engine (e.g., BigQuery).
If $R^{(k)}$ is a valid, non-empty result matching $F$ , and occurs twice in history (logical equivalence), output $S^{(k)}$ ; otherwise, continue.
Errors $\delta^{(k)}$ identified in execution (syntax, type, etc.) are appended to $H$ , and $k$ is incremented.
Terminate after $K_{max}$ iterations or if three consecutive failures occur.

The update is formalized as

$\begin{aligned} e^{(k)} &= \mathrm{exec}(S^{(k)}, \mathcal{I}),\ \delta^{(k)} &= \mathrm{parse\_error}(e^{(k)}),\ S^{(k+1)} &= \mathrm{LLM}(P_{\rm init}, F, (S^{(0)}, \delta^{(0)}, \dots, S^{(k)}, \delta^{(k)})) \end{aligned}$

A CTE-based fallback decomposes SQL into named Common Table Expressions upon repeated failures, localizing error correction to fragments with mismatched or unexpected intermediate outputs.

3. Consensus Enforcement and Candidate Aggregation

ReFoRCE employs a parallel workflow for robustness via consensus. $T$ independent threads (with $T=3$ ) execute the complete pipeline (compression, exploration, refinement) with stochastic sampling, producing $\{R^{(1)}, R^{(2)}, \dots, R^{(T)}\}$ . The majority candidate $R^*$ is selected by

$R^* = \argmax_q \phi(q)$

where $\phi(q)$ counts supporting threads. Ties trigger a fallback LLM prompt for plausibility ranking. Alternatively, a soft voting procedure integrates model-assigned scores $s_j$ for each candidate $Q_j$ :

$\mathrm{score}(q) = \sum_{j: Q_j = q} s_j, \qquad q^* = \argmax_q \mathrm{score}(q)$

This ensemble process increases robustness against spurious LLM generations and aids in disambiguating ambiguous queries.

4. Iterative Column Exploration and Feedback

Column exploration precedes final SQL synthesis to discover relevant features:

An auxiliary LLM session, primed with $P_{\rm init}$ and an exploration prompt, generates up to 10 candidate table.column references accompanied by simple selection queries ( $\mathrm{SELECT\ DISTINCT}$ , limited to 100 rows, no CTEs).
Each candidate query $S^{\mathrm{exp}}_i$ is executed; errors invoke a corrective sub-loop (max three retries per failed SQL).
On five consecutive failures, the exploration phase aborts.
Successfully retrieved sample values $R^{\mathrm{exp}}$ are joined into the main prompt context for the self-refinement phase.

The correction algorithm retries failed queries by providing failure feedback (“Your query failed with error $E$ . Please correct it.”) to the LLM and replaces initial statements until execution succeeds or the retry limit is reached.

5. Empirical Performance and Benchmark Analysis

ReFoRCE demonstrates new state-of-the-art execution accuracy (EX) on the Spider 2.0 benchmark and its Snowflake and Lite subsets. The following table reports key results:

Method	Model	Snowflake (EX)	Lite (EX)
ReFoRCE (ours)	o1-preview	26.69	24.50
Spider-Agent	o1-preview	20.29	20.66
Spider-Agent	GPT-4o	10.79	10.87
DAIL-SQL	GPT-4o	2.20	5.68
CHESS	GPT-4o	1.28	3.84
DIN-SQL	GPT-4o	0.00	1.46
SFT CodeS-15B	—	0.00	0.73

ReFoRCE improvements of approximately 6–7 percentage points (pp) over Spider-Agent highlight the impact of comprehensive table compression, format restriction, iterative column exploration, and self-refinement mechanisms. The marginal accuracy drop for the “Lite” task, which includes BigQuery and SQLite in addition to Snowflake, indicates minimal dialect-specific engineering overhead.

6. Components and Design Integration

ReFoRCE’s architecture can be characterized as a composition of the following principal mechanisms:

Pattern-based schema compression and LLM-guided schema linking for scalable context management.
Strict output format restrictions to maintain compliance with execution environments.
A history-augmented self-refinement pipeline, with dialect-specific corrective strategies.
Iterative column exploration enabled by execution-based feedback and LLM remediations.
Parallelized consensus enforcement through majority voting and confidence-based aggregation.

This multi-layered integration enables ReFoRCE to robustly and efficiently translate natural language queries into multi-dialect SQL, addressing the pathological long-context and semantic mapping challenges posed by large-scale structured data systems (Deng et al., 2 Feb 2025).

PDF Markdown Chat (Pro)

References (1)

ReFoRCE: A Text-to-SQL Agent with Self-Refinement, Consensus Enforcement, and Column Exploration (2025)

Whiteboard

Generate a whiteboard explanation of this topic.

Follow Topic

Get notified by email when new papers are published related to ReFoRCE Agent.

ReFoRCE Agent: Advanced Text-to-SQL

1. Database Information Compression and Schema Linking

2. Self-Refinement: Iterative Correction and CTE Decomposition

3. Consensus Enforcement and Candidate Aggregation

4. Iterative Column Exploration and Feedback

5. Empirical Performance and Benchmark Analysis

6. Components and Design Integration

Whiteboard

Follow Topic

Continue Learning

ReFoRCE Agent: Advanced Text-to-SQL

1. Database Information Compression and Schema Linking

2. Self-Refinement: Iterative Correction and CTE Decomposition

3. Consensus Enforcement and Candidate Aggregation

4. Iterative Column Exploration and Feedback

5. Empirical Performance and Benchmark Analysis

6. Components and Design Integration

Sponsor

Whiteboard

Follow Topic

Continue Learning

Related Topics