ReFoRCE Agent: Advanced Text-to-SQL
- ReFoRCE Agent is a Text-to-SQL system that uses schema compression and LLM-guided linking to manage large, heterogeneous enterprise databases.
- It leverages iterative self-refinement, error feedback, and CTE decomposition to generate accurate multi-dialect SQL queries.
- Consensus enforcement through parallel candidate voting and iterative column exploration ensures robust performance on benchmarks like Spider 2.0.
ReFoRCE is a Text-to-SQL agent designed to address the challenges of deploying natural language interfaces over large, heterogeneous, enterprise-grade databases, particularly those characterized by schema sizes exceeding 1,000 columns, a diversity of SQL dialects (such as BigQuery and Snowflake), and complex analytical query requirements. Distinct from prior approaches, ReFoRCE incorporates aggressive schema information compression, LLM-guided schema linking, iterative self-refinement, parallel consensus voting, and iterative column exploration. These mechanisms result in state-of-the-art performance on the Spider 2.0 benchmark, including its challenging “Snow” and “Lite” subsets (Deng et al., 2 Feb 2025).
1. Database Information Compression and Schema Linking
ReFoRCE implements a multi-stage pipeline to reduce input context length and enhance LLM relevance for large schemas. The first stage groups tables by common prefixes or suffixes using a string-prefix similarity:
Tables are merged if their prefix (or suffix) similarity exceeds a threshold , with only the lexicographically first table’s DDL (Data Definition Language) retained for each group. For example, the collection is partitioned into groups , yielding the compressed representation
Empirically, the compressed token length satisfies (typically 0.1–0.5 for very large databases). LLM-guided schema linking then selects only contextually relevant tables for the query via a lightweight LLM session, asking, “Which of these tables are likely to be needed to answer ?” Only DDLs for top tables, as identified by the LLM, are surfaced in detail, and the initial prompt is
where denotes supplementary context.
2. Self-Refinement: Iterative Correction and CTE Decomposition
ReFoRCE integrates an error-driven, dialect-agnostic self-refinement loop for SQL generation. The routine executes as follows:
- Initialize and history .
- Generate SQL candidate via with as the expected output format.
- Execute in the target engine (e.g., BigQuery).
- If is a valid, non-empty result matching , and occurs twice in history (logical equivalence), output ; otherwise, continue.
- Errors identified in execution (syntax, type, etc.) are appended to , and is incremented.
- Terminate after iterations or if three consecutive failures occur.
The update is formalized as
A CTE-based fallback decomposes SQL into named Common Table Expressions upon repeated failures, localizing error correction to fragments with mismatched or unexpected intermediate outputs.
3. Consensus Enforcement and Candidate Aggregation
ReFoRCE employs a parallel workflow for robustness via consensus. independent threads (with ) execute the complete pipeline (compression, exploration, refinement) with stochastic sampling, producing . The majority candidate is selected by
where counts supporting threads. Ties trigger a fallback LLM prompt for plausibility ranking. Alternatively, a soft voting procedure integrates model-assigned scores for each candidate :
This ensemble process increases robustness against spurious LLM generations and aids in disambiguating ambiguous queries.
4. Iterative Column Exploration and Feedback
Column exploration precedes final SQL synthesis to discover relevant features:
- An auxiliary LLM session, primed with and an exploration prompt, generates up to 10 candidate table.column references accompanied by simple selection queries (, limited to 100 rows, no CTEs).
- Each candidate query is executed; errors invoke a corrective sub-loop (max three retries per failed SQL).
- On five consecutive failures, the exploration phase aborts.
- Successfully retrieved sample values are joined into the main prompt context for the self-refinement phase.
The correction algorithm retries failed queries by providing failure feedback (“Your query failed with error . Please correct it.”) to the LLM and replaces initial statements until execution succeeds or the retry limit is reached.
5. Empirical Performance and Benchmark Analysis
ReFoRCE demonstrates new state-of-the-art execution accuracy (EX) on the Spider 2.0 benchmark and its Snowflake and Lite subsets. The following table reports key results:
| Method | Model | Snowflake (EX) | Lite (EX) |
|---|---|---|---|
| ReFoRCE (ours) | o1-preview | 26.69 | 24.50 |
| Spider-Agent | o1-preview | 20.29 | 20.66 |
| Spider-Agent | GPT-4o | 10.79 | 10.87 |
| DAIL-SQL | GPT-4o | 2.20 | 5.68 |
| CHESS | GPT-4o | 1.28 | 3.84 |
| DIN-SQL | GPT-4o | 0.00 | 1.46 |
| SFT CodeS-15B | — | 0.00 | 0.73 |
ReFoRCE improvements of approximately 6–7 percentage points (pp) over Spider-Agent highlight the impact of comprehensive table compression, format restriction, iterative column exploration, and self-refinement mechanisms. The marginal accuracy drop for the “Lite” task, which includes BigQuery and SQLite in addition to Snowflake, indicates minimal dialect-specific engineering overhead.
6. Components and Design Integration
ReFoRCE’s architecture can be characterized as a composition of the following principal mechanisms:
- Pattern-based schema compression and LLM-guided schema linking for scalable context management.
- Strict output format restrictions to maintain compliance with execution environments.
- A history-augmented self-refinement pipeline, with dialect-specific corrective strategies.
- Iterative column exploration enabled by execution-based feedback and LLM remediations.
- Parallelized consensus enforcement through majority voting and confidence-based aggregation.
This multi-layered integration enables ReFoRCE to robustly and efficiently translate natural language queries into multi-dialect SQL, addressing the pathological long-context and semantic mapping challenges posed by large-scale structured data systems (Deng et al., 2 Feb 2025).