Zero-shot & Few-shot NL2SQL

Updated 19 April 2026

Zero-shot and few-shot NL2SQL are approaches that enable models to translate natural language into SQL without extensive task-specific supervision, enhancing adaptability to unseen schemas.
These methods combine techniques such as slot-filling, meta-learning, prompt engineering, and execution-guided search to address challenges like schema generalization and complex query composition.
Empirical benchmarks show that hybrid PLM–LLM pipelines and self-augmentation strategies can significantly boost execution accuracy on diverse, unseen database environments.

Zero-shot and few-shot NL2SQL refer to methodologies that enable models to generate SQL queries from natural language without, or with minimal, task-specific supervision, especially on unseen database schemas, domains, or query structures. These approaches leverage large pre-trained LLMs (PLMs), LLMs, meta-learning, self-augmentation, execution-guided search, and prompt engineering. They are critical for adapting NL2SQL systems to real-world scenarios where exhaustive annotation is unattainable or where generalization to new environments is required.

1. Zero-Shot and Few-Shot NL2SQL: Definitions and Core Challenges

Zero-shot NL2SQL comprises mapping a natural language input $Q$ and database schema $S$ to a SQL query $Y$ , where the model has not been exposed to any annotated (Q, S, SQL) triplets for the target schema or domain. Few-shot NL2SQL permits a small number $k \ll |\text{train}|$ of annotated examples per test schema, leveraging either in-context learning or meta-learning.

Core challenges include:

Schema Generalization: Adapting to new tables, columns, and relational structures absent in training data.
Compositionality: Composing complex SQL patterns (e.g., joins, nested queries) absent from training supervision.
Semantic Parsing Robustness: Disambiguating question intents and aligning them precisely with schema elements, particularly when header names and cell values are ambiguous or unseen.
Minimal Data Regime: Achieving high test-time performance with little to no labeled data from the target domain—posing a substantial drop in accuracy for standard supervised models.

On standard benchmarks, the gap between zero/few-shot and fully supervised models remains non-trivial, especially for schemas with little overlap with training data (Pal et al., 2020, Chen et al., 2021, Gu et al., 2023, Pandey et al., 5 Nov 2025, Lee et al., 17 Feb 2025, Li et al., 24 Feb 2025).

2. Fundamental Methodological Paradigms

2.1 Slot-Filling and Sketch-Based Models

Early data-agnostic approaches (e.g., RoBERTa+Bi-LSTM slot models) decompose NL2SQL as a sequence of slot prediction tasks:

SELECT $\langle$ aggregate $\rangle$ $\langle$ column $\rangle$ FROM $\langle$ table $\rangle$ [WHERE $S$ 0col $S$ 1 $S$ 2op $S$ 3 $S$ 4val $S$ 5]*
Each slot is predicted with a separate submodel, and input features comprise question and header embeddings plus binary knowledge vectors encoding string matches (Pal et al., 2020).

2.2 Content-Aware and Meta-Learning Architectures

Advances include leveraging table content via similarity-driven cell selection for value grounding, and introducing meta-learning (e.g., MAML) so that parameter updates on support sets induce rapid generalization to novel tables. Auxiliary content features are used for WHERE-column and WHERE-value selection, and episodic training splits support and query tables to simulate zero-shot transfer (Chen et al., 2021).

2.3 LLMs and Prompt-Based Inference

LLMs perform text-to-SQL via prompt engineering, without or with very few in-context exemplars. Prompt templates range from:

Zero-shot: Provide only schema and instruction header; model must synthesize full SQL pattern de novo (Pandey et al., 5 Nov 2025).
Few-shot retrieval: Add $S$ 6 NL-SQL paired exemplars based on random or similarity-based retrieval or self-augmentation (Pandey et al., 5 Nov 2025, Lee et al., 17 Feb 2025, Hu et al., 2022).
In-context learning for dialogue state tracking: The IC-DST framework reformulates DST state changes as SQL SELECT statements and builds prompts with schema blocks, in-context exemplars, and a test request; exemplars are selected via SBERT-based retrieval (Hu et al., 2022).

2.4 Self-Augmentation and Example Filtering

SAFE-SQL introduces a self-augmentation framework: the LLM is prompted to generate $S$ 7 synthetic NL-SQL pairs and corresponding reasoning paths, which are then filtered by semantic similarity $S$ 8, structural alignment $S$ 9, and a reasoning path quality rating $Y$ 0. Only examples with composite score above $Y$ 1 are retained for in-context learning, removing dependency on external retrieval sets (Lee et al., 17 Feb 2025).

2.5 Execution-Guided Search and Monte Carlo Tree Search (MCTS)

Alpha-SQL frames zero-shot SQL generation as an MCTS problem, with LLMs dynamically proposing SQL construction actions (e.g., schema selection, column value identification, revision). Candidate SQLs are scored by self-supervised execution consistency rewards without requiring gold annotation, mitigating the combinatorial search space present in zero supervision (Li et al., 24 Feb 2025).

2.6 Hybrid PLM–LLM Pipelines

ZeroNL2SQL interleaves a schema-aware, fine-tuned PLM to generate valid “sketches” (SELECT/FROM/KEYWORDS) with an LLM that completes the sketch (WHERE, JOIN, ORDER BY) using black-box prompting and database-aware predicate calibration. Execution-based selection ensures functional correctness (Gu et al., 2023).

3. Prompt Construction and Example Selection Strategies

Prompt construction is central to both zero- and few-shot NL2SQL performance:

Prompt Type	Input Components	Example Selection
Zero-shot	Instruction, schema	None or formatting example
Few-shot (retrieval)	Instruction, schema, $Y$ 2 exemplars	Random, similarity, or retriever
Self-augmented	Instruction, schema, LLM-generated examples + filter	Self-generation + thresholding
Structured Few-shot	Canonicalized instruction/schema/response triples	Hand-crafted or auto-selected

In few-shot regimes, schema-aware and structured prompts provide substantial gains over naive concatenation. Ordering by relevance and explicit reasoning paths (as chain-of-thought demonstrations) further boost performance for compositional or extra-hard queries (Pandey et al., 5 Nov 2025, Lee et al., 17 Feb 2025, Hu et al., 2022).

4. Performance Analysis and Benchmark Results

Recent benchmarks on Spider, BIRD, and MultiWOZ demonstrate:

Data-agnostic RoBERTa-based models reach execution accuracy (Acc_ex) of 74.7% in true zero-shot regimes (unseen schemas), showing only a 1–2 point drop from seen schemas (Pal et al., 2020).
MC-SQL, leveraging table content and meta-learning, achieves 89.7% EX on WikiSQL zero-shot dev, +7.7% logical-form gain over baseline SQLova (Chen et al., 2021).
SAFE-SQL (self-augmentation) obtains execution accuracy of 87.9% (GPT-4o) on Spider dev, surpassing best few-shot baselines by over 1 pp and naive zero-shot by ≈15 pp (Lee et al., 17 Feb 2025).
GEMMA-SQL yields 65% EM/62% TS in zero-shot, with structured and schema-aware few-shot prompting pushing up to 88% EM/86% TS (Pandey et al., 5 Nov 2025).
Alpha-SQL reaches 69.7% EX on BIRD dev (Qwen2.5-32B), 2.5 pp above GPT-4o-based zero-shot, and matches best fine-tuned baselines on Spider dev with 87.0% (Li et al., 24 Feb 2025).
ZeroNL2SQL achieves a 44.9% execution accuracy on KaggleDBQA zero-shot (vs. 31.9% for PLM-only, 24.3% for vanilla ChatGPT), and 74.9% on Dr.Spider, indicating 10–20 pp gains from combining sketch PLMs, LLMs, and predicate calibration (Gu et al., 2023).

Few-shot gains are typically +10–20 pp for EM/TS when moving from zero-shot to schema/rank-aware prompting or in-context learning.

5. Component Ablations and Limitations

Ablation studies highlight critical design choices:

Removing schema linking or reasoning-path demonstrations in SAFE-SQL yields −7.5 and −3.5 pp EX, respectively; not using self-augmentation drops EX by −10.8 pp (Lee et al., 17 Feb 2025).
In Alpha-SQL, omitting key MCTS actions (schema selection, SQL revision) results in 0.6–1.8 pp accuracy deficit (Li et al., 24 Feb 2025).
Content-aware predictors yield >6 pp LF improvement on zero-shot WikiSQL; omitting meta-learning decreases performance by 1–2 pp (Chen et al., 2021).
Data-agnostic, privacy-preserving approaches depend acutely on descriptive header names and fail on ambiguous mappings or multi-table SQL (Pal et al., 2020).

Recognized limitations involve:

Restricted SQL expressiveness (single-table, limited clause support) in earlier models (Pal et al., 2020, Chen et al., 2021).
Dependence on LLM base quality for prompt-based/self-augmentation methods; weak LLMs generate poor or noisy synthetic examples (Lee et al., 17 Feb 2025).
Manual or heuristic thresholds for OOD and relevance filtering, which require tuning and may not generalize across domains (Lee et al., 17 Feb 2025, Pandey et al., 5 Nov 2025).
Computational cost for meta-learning and MCTS-based approaches; scaling to multi-turn or conversational NL2SQL is a continuing challenge (Gu et al., 2023, Li et al., 24 Feb 2025).

6. Extensions, Applications, and Future Directions

Current research highlights several promising directions:

Extending zero/few-shot NL2SQL to multi-turn, conversational settings, as in text-to-SQL-driven dialogue state tracking (Hu et al., 2022, Gu et al., 2023).
Hybridization of search-based (MCTS) and few-shot strategies, e.g., using few-shot completions as seeds for search or reward model fine-tuning (Li et al., 24 Feb 2025).
Automating relevance threshold calibration using statistical or adversarial methods (Lee et al., 17 Feb 2025).
Incorporating execution-guided or SMT-based symbolic constraints to prune semantically invalid candidate queries before prediction or as verifier models (Li et al., 24 Feb 2025, Gu et al., 2023).
Leveraging self-augmented or automatically synthesized demonstrations for downstream fine-tuning or as support for different SQL dialects and domain adaptation (Lee et al., 17 Feb 2025, Pandey et al., 5 Nov 2025).

A plausible implication is that, as LLM capabilities increase and example synthesis/filtering becomes more reliable and automated, the distinction between few-shot and zero-shot NL2SQL will further blur—making in-context, prompt-driven generation with execution feedback the default paradigm, especially for emerging database environments and unseen cross-domain table schemas.