SQL-Llama: SQL-Centric Llama Models

Updated 23 March 2026

SQL-Llama is a family of Llama-based models and frameworks specialized for SQL completion, natural language to SQL parsing, and SQL-centric reasoning, using supervised fine-tuning and prompt engineering.
It employs diverse architectures such as direct SQL completion, instruction-tuned pipelines, multi-agent decomposition, retrieval-augmented inference, and contrastive reinforcement learning to enhance performance.
SQL-Llama systems demonstrate practical industrial adoption and robust error correction through strategies like schema pruning, continual learning, and efficient token management.

SQL-Llama designates a family of Llama-based models and frameworks specialized for SQL completion, natural language to SQL (NL2SQL) parsing, and SQL-centric reasoning, encompassing both academic and industrial systems. These models adapt and extend Llama and Code Llama backbones through supervised fine-tuning, prompt engineering, multi-agent orchestration, continual-learning retrieval, task decomposition, and reinforcement learning. SQL-Llama systems establish state-of-the-art performance and strong efficiency on challenging text-to-SQL benchmarks and at industrial scale, with broad adoption across open-source and proprietary applications.

1. Architectural Families and Core Principles

SQL-Llama encompasses multiple system designs; principal instantiations include:

Direct SQL completion architectures: Models such as SqlComposeSA and SqlComposeFIM, based on Llama/Code Llama, fine-tuned on large SQL corpora and trained for left-to-right or fill-in-the-middle objectives using multi-million scale verified SQL queries. SqlComposeFIM employs a Language Causal Masking (LCM) objective which enables bidirectional context utilization (Maddila et al., 2024).
Instruction-tuned text-to-SQL pipelines: Code Llama 7B and 13B models are full-model fine-tuned or LoRA-adapted using large-scale NL–SQL pairs for instruction following, domain adaptation, and chain-of-thought SQL generation. Few-shot learning and prompt selection are leveraged for schema and context encoding (Wang et al., 2023, Chen et al., 2024, Sun et al., 2023).
Multi-agent and modular pipelines: MAC-SQL and similar frameworks introduce decomposer–selector–refiner agent abstractions, each instantiated via SQL-Llama, enabling schema pruning, multi-step CoT decomposition, and error-driven correction (Wang et al., 2023).
Retrieval-augmented continual learners: LPE-SQL augments Llama-3.1 with retrieval over explicit knowledge bases (correct/mistake notebooks), equipping the model with non-parametric, runtime-only continual learning via demonstration selection, cross-consistency voting, and self-updating of the auxiliary KB (Chu et al., 2024).
Contrastive reward and RL fine-tuned models: Recent work implements Group Relative Policy Optimization (GRPO) with a contrastive semantic reward in LLaMA-3 architectures, optimizing execution and semantic alignment across languages (Kattamuri et al., 10 Oct 2025).

Context-awareness, bidirectional infilling, schema exposure, error feedback, and modular agentic decomposition are central to SQL-Llama’s performance improvements.

2. Training Protocols, Fine-Tuning, and Data Curation

SQL-Llama systems are universally derived from pre-trained Llama or Code Llama checkpoints (7B–13B parameters dominate for cost/performance). Core training protocols are as follows:

Supervised fine-tuning: Cross-entropy minimized on gold SQL tokens given prompt–completion pairs, typically with extended schema and database context included in the prompt (Maddila et al., 2024, Chen et al., 2024, Rebei, 2023). LoRA (low-rank) adapters are often used for efficient adaptation, supporting single-GPU training.
Data pyramid/curation procedures:
- Massive code and NL corpora for pretraining
- Production SQL artifacts: up to 10M verified, runtime-safe queries with schema expansion
- Heuristic de-duplication, execution validation, and complexity-tier balancing (inspired by Spider/BIRD splits)
- Dialect and domain adaptation (Snowflake, GoogleSQL, retail analytics, etc.)
- Synthetic data augmentation (paraphrasing, self-healing loops for ill-formed queries)
Chain-of-Thought and task decomposition: Prompt templates interleave schema linking, skeleton-based SQL infilling, and step-wise question breakdown for improved logical correctness (Chen et al., 2024, Wang et al., 2023).
Reinforcement learning and contrastive tuning: GRPO-based RL is directly employed with semantic and execution reward in (Kattamuri et al., 10 Oct 2025).

Prompt length, schema chunking, and example selection are controlled for within model token limits (2K–8K tokens, code models support longer contexts). Table/column truncation schemes selectively retain high-salience schema elements.

3. Inference Workflows, Agentic Orchestration, and Error Correction

SQL-Llama-based systems are typically executed as follows:

Prompt assembly: User NL question, schema (often auto-pruned), and optional external knowledge are serialized into a hierarchical or OpenPrompt-style input.
Agentic execution:
- Selector agent: Given resource constraints or large schemas, invokes schema-pruning (keep/drop columns/tables) via instruction-tuned Llama models.
- Decomposer agent: Employs in-context chain-of-thought, breaks complex queries into subproblems, and emits intermediate/final SQL.
- Refiner agent: Catches syntax/runtime errors or empty result sets. On failure, error messages and the original query are re-prompted to the LLM for self-correction in constrained loops (Wang et al., 2023).
- External tools: MAC-SQL and similar frameworks use tool calls (e.g., schema lookup, execution, visualization) in a ReAct/plan–act–observe pattern for tool-augmented reasoning (Redd et al., 29 Oct 2025).
Retrieval-augmented inference: LPE-SQL injects top-k exemplars from correct/mistake notebooks into the prompt, with cross-consistency voting across demonstration mixes, and automatic notebook updates based on execution parity with ground truth (Chu et al., 2024).
Continual learning: Log correct/incorrect results with associated reasoning or error correction for future in-context retrieval (no parameter update required).

Failover strategies combine majority voting, schema-masked reranking, and dynamic querying for maximal robustness.

4. Evaluation Metrics, Empirical Results, and Cost Analysis

SQL-Llama performance is benchmarked on canonical datasets (Spider, BIRD, MultiSpider, classical sets) with metrics such as:

Execution Accuracy (ExecAcc): $\frac{1}{N}\sum_{i=1}^N \mathbb{1}[\textrm{exec}(\hat{y}_i)=\textrm{exec}(y_i)]$
Exact Match (EM): String-level SQL match
BLEU / Clause Containment Score (CS) / Table-Match Score (TMS)
Semantic Accuracy (SemAcc): Verified equivalence across database states (Kattamuri et al., 10 Oct 2025)

Key results include:

SqlComposeFIM (Meta, 7B): +35 pp BLEU, +20 pp EM, +13 pp TMS over prefix-only models on multi-line completion; achieves 75% table-match on multi-line queries (Maddila et al., 2024).
Code-Llama+LoRA (13B): 81.58% (Snowflake) / 82.66% (GoogleSQL) accuracy on domain-specific retail NL2SQL (vs. GPT-4’s 45–48%) in zero-shot, with sub-10s latency (Rebei, 2023).
MAC-SQL + SQL-Llama(7B): 43.94% execution accuracy on BIRD test, closely approaching GPT-4’s 46.35% (Wang et al., 2023).
OpenPrompt + SFTI-CoT-SK-FULL with Code-Llama-7B: 48.24% BIRD-dev, exceeding GPT-4’s 46.35%; schema chunking and token-efficient selection critical for large input spaces (Chen et al., 2024).
Contrastive RL (LLaMA-3-3B, MultiSpider): 88.9% ExecAcc / 59.1% SemAcc (multilingual), outperforming a zero-shot LLaMA-8B model despite smaller parameter count (Kattamuri et al., 10 Oct 2025).
Efficiency: QLoRA-quantized (4-bit) pipelines with Llama2-7B + CodeLlama-13B achieve ~47% execution accuracy at 1/135th parameter count, 90× speedup, >100× lower cost relative to GPT-4 on BIRD-SQL (Domínguez et al., 2024).
Resource usage: Despite improvements, even quantized 7B models require $10^4$ – $10^8$ x more energy per query than a native SQLite engine, capping practicality for large-scale querying (Zhang et al., 2024).

5. Practical Deployment and Industry Adoption

SQL-Llama models are widely deployed in both research and production contexts:

Meta’s SqlCompose: Inline suggestions in Daiquery notebooks for Presto/Spark SQL, serving 16K+ developers, 4.5M suggestions/quarter with 21% acceptance, <200 ms latency (Maddila et al., 2024).
Feedback and adaptation: User acceptance, character-per-opportunity, retention, opt-out, and qualitative feedback systematically looped into schema-adaptation and user experience optimization.
Token-oriented optimizations: Column/table selection and truncation, data chunking, variable-length schema encoding, and prompt adaptation enable use of small models in large, multi-table analytical settings (Chen et al., 2024, Domínguez et al., 2024).
Portability: SQL-Llama inference graphs can be compiled into SQL and executed on any standard RDBMS (Postgres, SQLite, DuckDB), enabling LLM serving independent of specialized hardware or ML infrastructure (Sun et al., 5 Feb 2025).

Model architectures and context windows are regularly adapted to accommodate the evolving data and query complexity present in enterprise environments.

6. Limitations and Ongoing Challenges

Outstanding limitations and areas for further refinement include:

Long-context and schema generalization: Context truncation remains a bottleneck for very large databases; token-efficient schemes and chunking are partial but not complete remedies (Chen et al., 2024).
Semantic and dialect generalization: Fine-tuned models may overfit their training schemas and SQL dialects, limiting out-of-distribution robustness. Generating window functions or handling advanced analytics is out-of-scope for narrow-domain tuned models (Rebei, 2023).
Residual hallucinations: Table/column hallucination rates persist (down to ~25% error in production systems for multi-line queries), though mitigated by schema-exposure and infill objectives (Maddila et al., 2024).
Resource utilization: Even highly optimized Llama models incur significant energy and inference overhead per query compared to native SQL, imposing practical and environmental constraints (Zhang et al., 2024).
Human parity: On BIRD, best Llama-based approaches remain ≥20 pp below human performance (93%), and 10–30 pp below closed, larger LLMs (Chu et al., 2024).

Advances using GRPO with contrastive rewards and retrieval-augmented prompt design suggest promising directions for future gains in multilingual and cross-domain reasoning (Kattamuri et al., 10 Oct 2025, Chu et al., 2024). Incorporating formal semantic rewards, external symbolic parsers (e.g., PICARD), and dynamic schema retrieval are active areas of extension.

7. Summary Table: Major SQL-Llama Variants and Empirical Accuracies

Model/Framework	Architecture / Tuning	Context/Domain	Benchmark	ExecAcc (%)	Special Features
SqlComposeFIM	Llama2-7B, FIM+FT	Meta SQL prod	SeqComp	20–50	Bi-directional infilling, 10M queries
Code-Llama+LoRA-13B	LoRA, retail+synthetic	Snowflake/GoogleSQL	Retail test	81.5–82.7	CTE style, modular SQL, 5s latency
SQL-Llama (MAC-SQL)	Code-Llama-7B, full FT	General/Spider/BIRD	BIRD dev	43.9	Multi-agent: Selector, Decomposer, Refiner
Open-SQL + Code-Llama	LoRA + SFTI-CoT-SK-FULL	BIRD (open-text)	BIRD dev	48.2	Token-efficient schema, CoT, 7B model
SQL-Llama (LPE-SQL)	Llama3.1-70B, retrieval	General/BIRD	BIRD dev	66.4	Continual KB, cross-consistency voting
LLaMA-3-3B (GRPO)	RL + contrastive reward	MultiSpider, 7 langs	MultiSpider	88.9	Semantic/exec reward, LoRA adapters