SQL-PaLM Framework

Updated 6 September 2025

The paper introduces SQL-PaLM, a framework for Text-to-SQL translation leveraging prompt engineering, instruction fine-tuning, and execution-guided ranking to boost performance.
The methodology employs schema-aware encoding, synthetic data augmentation, and retrieval-based column selection to manage large and complex database schemas.
Empirical results on Spider and BIRD benchmarks show significant execution accuracy improvements, underlining SQL-PaLM’s effectiveness in real-world scenarios.

The SQL-PaLM framework is a LLM–adaptation strategy for Text-to-SQL tasks, designed to enhance natural language to SQL translation via prompt engineering, instruction fine-tuning, execution-guided ranking, and schema-aware navigation. SQL-PaLM orchestrates these techniques to achieve improved performance, robustness to large and complex database schemas, and adaptability across benchmarks such as Spider and BIRD (Sun et al., 2023). Key innovations include consistency decoding with execution-based error filtering, task-specific schema and content serialization, synthetic data augmentation, query-specific content injection, retrieval-based schema selection, and test-time output selection with execution feedback.

1. Input Representation and Prompt Design

Central to SQL-PaLM is its input representation strategy, tailored for LLMs in Text-to-SQL generation. The prompt is structured as:

$X = f(X_1;\ X_2; Q; \text{[SQL]})$

Here:

$X_1$ serializes the database schema: pipe-delimited table and column names, explicit types, primary and foreign keys (e.g., “| Table_1 : Col_1 (d_1), Col_2 (d_2), ... | Table_2: ... | K_p ; K_f ;”).
$X_2$ incorporates auxiliary components: column descriptions, query-specific database content (e.g., cell entries with high similarity to question tokens), and hints.
$Q$ is the natural language question.
$[SQL]$ signals SQL generation.

This encoding accommodates both “concise” and “verbose” schema descriptions and permits incorporation of extra schema content when necessary. The concatenation of $X_1$ and $X_2$ exploits the LLM’s generalization capacity, allowing direct mapping from natural language to SQL—in particular, supporting both precise column/table mapping and content-aware disambiguation.

2. Learning Regimes: Few-shot Prompting and Instruction Fine-tuning

The framework supports two predominant learning regimes:

Few-shot Prompting: Prepend a limited set of demonstration pairs $(X_{\text{ex}}, Y_{\text{ex}})$ to the query, and auto-regressively decode SQL candidates. Quality is further improved via “consistency decoding with execution-based error filtering”: sample $\Omega = \{ \widehat{Y}_i \}$ from $P_{\text{LLM}}(\widehat{Y}|X)$ at high temperature, run each candidate through an external SQL executor $\mathcal{E}$ , aggregate candidates with matching successful execution results, and select:

$R(X,\widehat{Y}) = \sum_{Y \in \Omega} P_{\text{LLM}}(\widehat{Y} \mid X) \cdot \mathbb{1}\{ \mathcal{E}(Y) = \mathcal{E}(\widehat{Y}) \} \cdot \mathbb{1}\{ \mathcal{E}(\widehat{Y}) \in \Phi \}$

$Y^* = \underset{\widehat{Y} \in \Omega}{\mathrm{argmax}} \ R(X,\widehat{Y})$

where $\Phi$ is the set of error-free execution results.

Instruction Fine-tuning: LLMs are fine-tuned to maximize:

$\max_\theta \sum_{(X, Y) \in d_{\text{train}}} \log P_{\text{LLM}_\theta}(Y|X)$

on a mixture of datasets ( $d_{\text{mix}}$ ; e.g., Spider, BIRD) with coverage/dialect diversity. This regime also incorporates synthetic SQL augmentation: for each $(Q, \text{SQL}^*)$ , prompt the model to generate alternatives; validate each alternative via exact execution and similarity threshold, and only add executable, nontrivial candidates to the training set.

Fine-tuning leverages expanded schema representations, robustly anchors schema items, and supports model adaptation across domains and SQL dialects.

3. Execution-Guided Test-Time Selection

To harness the LLM’s adaptive SQL synthesization, SQL-PaLM performs output aggregation and output reranking:

Multiple SQL outputs are generated (from distinct paradigms, prompt seeds, models, etc.).
Each output is executed.
Outputs producing the same valid execution result are grouped; majority execution result group is preferred.

Formally, for candidates $\Omega$ , select:

$Y^* = \underset{\widehat{Y} \in \Omega}{\mathrm{argmax}} \ R(X,\widehat{Y})$

where $R(X,\widehat{Y})$ combines output probability and group execution consistency.

Ablation studies reveal that omitting either execution-based decoding or output error filtering leads to $\sim 3.5$ –$5$ percentage point performance drops on Spider/BIRD.

SQL-PaLM addresses large, multi-table databases with two practical schema navigation strategies:

Retrieval-based Column Selection: Encode the user question and schema columns into embeddings ( $Q_e, C_e$ ), rank columns by cosine similarity:

$s = \operatorname{cosine}(Q_e, C_e) = \frac{Q_e \cdot C_e}{\| Q_e \| \| C_e \|}$

Include only top- $K$ ranked columns in the prompt, thereby mitigating “needle-in-a-haystack” issues.

Program-aided Column Selection: Generate a preliminary SQL (using an auxiliary or lightweight model) over the full schema, parse referenced columns, and use only these relevant columns as input for the main LLM.

Both methods support “hard” (truncate prompt) and “soft” (emphasize with extra context) selection; ablations show each yields measurable gains, especially in the presence of large schemas or limited prompt token budgets.

5. Augmentation Techniques and Query-Specific Content Integration

To maximize data efficiency and improve LLM generalization, SQL-PaLM includes:

Synthetic Data Augmentation: Prompt the LLM (or auxiliary generator) to synthesize alternative SQLs for a given question or gold SQL. Candidates passing validation (execution correctness, structural difference thresholds) are retained for additional training.
Query-Specific Database Content: For each token in $Q$ , automatically match against database entries using substring matching or embedding similarity, then integrate the top-matching cell values or records into $X_2$ . This resolves schema-to-question synonym gaps and bolsters model performance, especially when the user’s phrasing diverges from the schema vocabulary.

Ablation results confirm that both methods incrementally improve execution accuracy (e.g., $+3.65\%$ for query-specific content on BIRD (Sun et al., 2023)).

6. Benchmark Results and Comparative Analysis

SQL-PaLM has been empirically validated on Spider and BIRD:

Spider (few-shot prompting): Execution accuracy $\sim 82.7\%$ , test-suite accuracy $77.3\%$ .
BIRD (fine-tuning): Execution accuracy $\sim 61.7\%$ , which improves with mixed-dataset training, synthetic data, and column selection.

Error analysis indicates residual issues often arise from ambiguous questions or evaluation stringency (e.g., formatting/type mismatches).

When compared with retrieval-augmented, multi-agent, or programmatic frameworks (e.g., MAC-SQL (Wang et al., 2023), MAG-SQL (Xie et al., 15 Aug 2024), XiYan-SQL (Liu et al., 7 Jul 2025)), SQL-PaLM’s emphasis on schema-aware encoding, execution validation, and flexible schema content integration is a key differentiator for large, heterogeneous, or ambiguous relational workloads.

7. Limitations and Prospects

The principal limitations center on:

Scaling prompt content to extremely large schemas (hundreds or thousands of columns).
LLM hallucination when query/ground truth mappings are underspecified.
Non-semantically equivalent output penalization in current benchmarks.

Suggested improvements include further enhancement of table and column selection, more sophisticated data and schema augmentation, richer integration of multiple input components (instructions, schema, content, and hinting), parameter-efficient adaptation for domain transfer, and development of more semantically tolerant evaluation metrics.

The SQL-PaLM architecture, integrating prompt design, robust fine-tuning, execution-guided selection, and schema navigation, sets a foundation for next-generation Text-to-SQL frameworks that approach real-world, large-scale database accessibility without loss of accuracy.