Papers
Topics
Authors
Recent
Search
2000 character limit reached

Zero-shot & Few-shot NL2SQL

Updated 19 April 2026
  • Zero-shot and few-shot NL2SQL are approaches that enable models to translate natural language into SQL without extensive task-specific supervision, enhancing adaptability to unseen schemas.
  • These methods combine techniques such as slot-filling, meta-learning, prompt engineering, and execution-guided search to address challenges like schema generalization and complex query composition.
  • Empirical benchmarks show that hybrid PLM–LLM pipelines and self-augmentation strategies can significantly boost execution accuracy on diverse, unseen database environments.

Zero-shot and few-shot NL2SQL refer to methodologies that enable models to generate SQL queries from natural language without, or with minimal, task-specific supervision, especially on unseen database schemas, domains, or query structures. These approaches leverage large pre-trained LLMs (PLMs), LLMs, meta-learning, self-augmentation, execution-guided search, and prompt engineering. They are critical for adapting NL2SQL systems to real-world scenarios where exhaustive annotation is unattainable or where generalization to new environments is required.

1. Zero-Shot and Few-Shot NL2SQL: Definitions and Core Challenges

Zero-shot NL2SQL comprises mapping a natural language input QQ and database schema SS to a SQL query YY, where the model has not been exposed to any annotated (Q, S, SQL) triplets for the target schema or domain. Few-shot NL2SQL permits a small number ktraink \ll |\text{train}| of annotated examples per test schema, leveraging either in-context learning or meta-learning.

Core challenges include:

  • Schema Generalization: Adapting to new tables, columns, and relational structures absent in training data.
  • Compositionality: Composing complex SQL patterns (e.g., joins, nested queries) absent from training supervision.
  • Semantic Parsing Robustness: Disambiguating question intents and aligning them precisely with schema elements, particularly when header names and cell values are ambiguous or unseen.
  • Minimal Data Regime: Achieving high test-time performance with little to no labeled data from the target domain—posing a substantial drop in accuracy for standard supervised models.

On standard benchmarks, the gap between zero/few-shot and fully supervised models remains non-trivial, especially for schemas with little overlap with training data (Pal et al., 2020, Chen et al., 2021, Gu et al., 2023, Pandey et al., 5 Nov 2025, Lee et al., 17 Feb 2025, Li et al., 24 Feb 2025).

2. Fundamental Methodological Paradigms

2.1 Slot-Filling and Sketch-Based Models

Early data-agnostic approaches (e.g., RoBERTa+Bi-LSTM slot models) decompose NL2SQL as a sequence of slot prediction tasks:

  • SELECT \langleaggregate\rangle \langlecolumn\rangle FROM \langletable\rangle [WHERE SS0colSS1 SS2opSS3 SS4valSS5]*
  • Each slot is predicted with a separate submodel, and input features comprise question and header embeddings plus binary knowledge vectors encoding string matches (Pal et al., 2020).

2.2 Content-Aware and Meta-Learning Architectures

Advances include leveraging table content via similarity-driven cell selection for value grounding, and introducing meta-learning (e.g., MAML) so that parameter updates on support sets induce rapid generalization to novel tables. Auxiliary content features are used for WHERE-column and WHERE-value selection, and episodic training splits support and query tables to simulate zero-shot transfer (Chen et al., 2021).

2.3 LLMs and Prompt-Based Inference

LLMs perform text-to-SQL via prompt engineering, without or with very few in-context exemplars. Prompt templates range from:

2.4 Self-Augmentation and Example Filtering

SAFE-SQL introduces a self-augmentation framework: the LLM is prompted to generate SS7 synthetic NL-SQL pairs and corresponding reasoning paths, which are then filtered by semantic similarity SS8, structural alignment SS9, and a reasoning path quality rating YY0. Only examples with composite score above YY1 are retained for in-context learning, removing dependency on external retrieval sets (Lee et al., 17 Feb 2025).

2.5 Execution-Guided Search and Monte Carlo Tree Search (MCTS)

Alpha-SQL frames zero-shot SQL generation as an MCTS problem, with LLMs dynamically proposing SQL construction actions (e.g., schema selection, column value identification, revision). Candidate SQLs are scored by self-supervised execution consistency rewards without requiring gold annotation, mitigating the combinatorial search space present in zero supervision (Li et al., 24 Feb 2025).

2.6 Hybrid PLM–LLM Pipelines

ZeroNL2SQL interleaves a schema-aware, fine-tuned PLM to generate valid “sketches” (SELECT/FROM/KEYWORDS) with an LLM that completes the sketch (WHERE, JOIN, ORDER BY) using black-box prompting and database-aware predicate calibration. Execution-based selection ensures functional correctness (Gu et al., 2023).

3. Prompt Construction and Example Selection Strategies

Prompt construction is central to both zero- and few-shot NL2SQL performance:

Prompt Type Input Components Example Selection
Zero-shot Instruction, schema None or formatting example
Few-shot (retrieval) Instruction, schema, YY2 exemplars Random, similarity, or retriever
Self-augmented Instruction, schema, LLM-generated examples + filter Self-generation + thresholding
Structured Few-shot Canonicalized instruction/schema/response triples Hand-crafted or auto-selected

In few-shot regimes, schema-aware and structured prompts provide substantial gains over naive concatenation. Ordering by relevance and explicit reasoning paths (as chain-of-thought demonstrations) further boost performance for compositional or extra-hard queries (Pandey et al., 5 Nov 2025, Lee et al., 17 Feb 2025, Hu et al., 2022).

4. Performance Analysis and Benchmark Results

Recent benchmarks on Spider, BIRD, and MultiWOZ demonstrate:

  • Data-agnostic RoBERTa-based models reach execution accuracy (Acc_ex) of 74.7% in true zero-shot regimes (unseen schemas), showing only a 1–2 point drop from seen schemas (Pal et al., 2020).
  • MC-SQL, leveraging table content and meta-learning, achieves 89.7% EX on WikiSQL zero-shot dev, +7.7% logical-form gain over baseline SQLova (Chen et al., 2021).
  • SAFE-SQL (self-augmentation) obtains execution accuracy of 87.9% (GPT-4o) on Spider dev, surpassing best few-shot baselines by over 1 pp and naive zero-shot by ≈15 pp (Lee et al., 17 Feb 2025).
  • GEMMA-SQL yields 65% EM/62% TS in zero-shot, with structured and schema-aware few-shot prompting pushing up to 88% EM/86% TS (Pandey et al., 5 Nov 2025).
  • Alpha-SQL reaches 69.7% EX on BIRD dev (Qwen2.5-32B), 2.5 pp above GPT-4o-based zero-shot, and matches best fine-tuned baselines on Spider dev with 87.0% (Li et al., 24 Feb 2025).
  • ZeroNL2SQL achieves a 44.9% execution accuracy on KaggleDBQA zero-shot (vs. 31.9% for PLM-only, 24.3% for vanilla ChatGPT), and 74.9% on Dr.Spider, indicating 10–20 pp gains from combining sketch PLMs, LLMs, and predicate calibration (Gu et al., 2023).

Few-shot gains are typically +10–20 pp for EM/TS when moving from zero-shot to schema/rank-aware prompting or in-context learning.

5. Component Ablations and Limitations

Ablation studies highlight critical design choices:

  • Removing schema linking or reasoning-path demonstrations in SAFE-SQL yields −7.5 and −3.5 pp EX, respectively; not using self-augmentation drops EX by −10.8 pp (Lee et al., 17 Feb 2025).
  • In Alpha-SQL, omitting key MCTS actions (schema selection, SQL revision) results in 0.6–1.8 pp accuracy deficit (Li et al., 24 Feb 2025).
  • Content-aware predictors yield >6 pp LF improvement on zero-shot WikiSQL; omitting meta-learning decreases performance by 1–2 pp (Chen et al., 2021).
  • Data-agnostic, privacy-preserving approaches depend acutely on descriptive header names and fail on ambiguous mappings or multi-table SQL (Pal et al., 2020).

Recognized limitations involve:

6. Extensions, Applications, and Future Directions

Current research highlights several promising directions:

A plausible implication is that, as LLM capabilities increase and example synthesis/filtering becomes more reliable and automated, the distinction between few-shot and zero-shot NL2SQL will further blur—making in-context, prompt-driven generation with execution feedback the default paradigm, especially for emerging database environments and unseen cross-domain table schemas.

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Zero-shot and Few-shot NL2SQL.