JOLT-SQL: Unified Single-Stage Text-to-SQL

Updated 20 December 2025

JOLT-SQL is a supervised fine-tuning framework that integrates schema linking and SQL generation into a single-stage training process, enhancing both robustness and efficiency.
It utilizes innovations like marker-token classification, local bidirectional attention, and confusion-aware noisy schema sampling to effectively manage schema noise.
Empirical results demonstrate state-of-the-art execution accuracy and improved training/inference speeds on benchmarks such as Spider and BIRD.

JOLT-SQL is a supervised fine-tuning (SFT) framework developed for mapping natural language utterances to executable SQL queries, specifically addressing the limitations of existing SFT approaches that utilize complex multi-stage pipelines and suffer degraded robustness in the presence of schema noise. By structurally integrating schema linking and SQL generation into a single-stage, joint-optimization process, JOLT-SQL achieves improved training and inference efficiency, as well as enhanced accuracy under real-world conditions where database schema information may contain irrelevant or noisy elements. The framework innovations include discriminative schema linking using marker-token classification, local bidirectional attention (LBA), schema-selective attention for SQL generation, and a confusion-aware noisy schema sampling (NSS) strategy. JOLT-SQL demonstrates state-of-the-art execution accuracy on established benchmarks with open-source LLMs of comparable size (Song et al., 20 May 2025).

1. Unified Single-Stage Fine-Tuning Framework

JOLT-SQL replaces traditional multi-stage pipelines—in which models are typically fine-tuned separately for schema linking and SQL generation—with a joint, single-stage training regime. Each input instance during training is represented as a token sequence:

$X = \langle\text{Prefix}\rangle \parallel \langle\text{Schema}\rangle \parallel \langle\text{Query}\rangle$
- $\langle\text{Prefix}\rangle$ encodes user intent and task instruction.
- $\langle\text{Schema}\rangle$ encodes the database schema, with a special marker token inserted after each column definition.
- $\langle\text{Query}\rangle$ contains the reference SQL, used exclusively during SFT.

The model simultaneously:

Performs discriminative classification for schema linking at each marker token.
Predicts the next SQL token, conditioned on a dynamically constructed, schema-selective attention mask.

The total loss for each training step is a straightforward sum:

$L_\text{total} = L_\text{SL} + L_\text{NTP}$

No additional weighting or task scaling is applied; both losses contribute equally (Song et al., 20 May 2025).

2. Discriminative Schema Linking and Attention Design

Schema linking is implemented as a marker token classification problem. For a token sequence of length $n$ with last-layer hidden states $H \in \mathbb{R}^{n \times d}$ , the model computes marker relevance as:

$\hat{y}_i = \sigma(W \cdot h_i)$

with $W \in \mathbb{R}^{1 \times d}$ trained, $\sigma$ the sigmoid function. Ground truth $y_i = 1$ for marker tokens corresponding to columns referenced in the gold SQL; otherwise $0$. Markers are identified by a binary mask $m_i$ .

The schema linking loss is the average binary cross-entropy over marker positions:

$L_\text{SL} = -\frac{1}{\sum_i m_i} \sum_{i=1}^n m_i \left[y_i \log \hat{y}_i + (1-y_i) \log (1-\hat{y}_i)\right]$

To overcome limitations of standard decoder-only architectures—where schema tokens lack awareness of global schema context—JOLT-SQL introduces Local Bidirectional Attention (LBA). Schema tokens attend bidirectionally to all columns and the prefix, with marker tokens excluded except when self-referencing. Non-schema tokens retain the native causal attention regime. This enables each column definition to contextualize its relevance decision within the schema’s overall structure (Song et al., 20 May 2025).

3. Schema-Selective Attention for SQL Generation

During the SQL generation phase, JOLT-SQL employs schema-selective attention. Inference exposes the decoder only to schema items classified as relevant by marker probabilities (threshold $\hat{y}_i > 0.05$ for high recall). Training attention masks mimic inference, suppressing non-relevant schema and injecting a sampled set of noisy schema items to regularize and enhance robustness.

Formally, for the indices of gold-standard schema items $I_\text{GT}$ , injected noise $I_\text{noisy}$ , and SQL tokens $I_\text{qry}$ , attention is defined for SQL positions as:

$A(x_i) = \left(I_\text{prefix} \cup I_\text{GT} \cup I_\text{noisy} \cup \{j \in I_\text{qry} | j \leq i\}\right) \setminus I_\text{marker}$

Autoregressive next-token prediction loss is applied:

$L_\text{NTP} = -\frac{1}{m} \sum_{i=n-m+1}^n \log P(x_i \mid \text{context}; A(x_i))$

where $m$ is the length of the SQL sequence (Song et al., 20 May 2025).

4. Confusion-Aware Noisy Schema Sampling (NSS)

To address real-world inference noise, a confusion-aware NSS algorithm samples schema distractors during training. For each example:

Identify noise pool $S_\text{cand} = S_\text{schema} \setminus S_\text{GT}$ .
Compute confusion scores $\hat{y}_j$ using a forward pass (no gradient) for all $j \in S_\text{cand}$ .
Draw $k = \lfloor u \cdot |S_\text{schema}| \rfloor$ noise items, where $u \sim \text{Uniform}(0, \beta)$ and $\beta$ is task-specific ($0.2$ for Spider, $0.1$ for BIRD).
Sample indices proportionally to confusion scores: $S_\text{noisy} \leftarrow \text{Sample}(S_\text{cand}, \text{weights}=\hat{y}, \text{count}=k)$ .

Confusion scores are cached after the first epoch to minimize training overhead. Noisy schema items are considered within the dynamic attention mask for SQL prediction during subsequent training passes (Song et al., 20 May 2025).

5. End-to-End Training and Inference

The model initialization and end-to-end joint training loop follows:

initialize model
for epoch in 1..E:
  for each batch:
    forward(X) → H, compute L_SL (§2.1)
    if epoch==1 and no cache: store ŷ for S_cand
    sample S_noisy using cached ŷ and β (§4.2)
    rebuild attention mask A(·) (§2.2 & §3.1)
    forward(X with mask) → next‐token logits, compute L_NTP (§3.2)
    L = L_SL + L_NTP
    backward(L), update parameters

At inference, relevant schema markers are selected (

\hat{y}_i > 0.05

), all others are pruned, and standard causal decoding is used for SQL generation (Song et al., 20 May 2025).

6. Empirical Evaluation and Ablation

Extensive experiments compare JOLT-SQL to prior methods (including DTS-SQL and standard fine-tuning) on the Spider and BIRD benchmarks using Qwen2.5-Coder-7B/14B models. Notable findings:

Method/Metric	Spider Dev	Spider Test	BIRD Dev
JOLT-SQL (14B) EX Accuracy	88.4%	88.9%	64.9%
DTS-SQL (generative, 7B)	P=93.7% R=94.2% (linking)
JOLT-SQL (7B, linking)	P=88.1% R=98.1% ROC-AUC=99.9% PR-AUC=98.7%

Ablation on JOLT-SQL (7B) for Spider Dev/BIRD Dev (removing LBA, NSS, or Selective Attention/All):

–LBA: 84.8% / 58.3%
–NSS: 86.1% / 58.6%
–SelectiveAttention: 85.9% / 59.2%
–All: 84.5% / 57.7%

Efficiency metrics (Spider, 3 epochs, A30):

Training: Standard SFT 4h38m; DTS-SQL 7h10m (+53%); JOLT-SQL 5h5m (+8.5%)
Inference: Standard SFT 0.94s/ex; DTS-SQL 1.34s/ex; JOLT-SQL 0.88s/ex (schema linking: 0.11s; SQL decode: 0.77s)

These results establish state-of-the-art execution accuracy and strong efficiency for JOLT-SQL among open-source LLMs of comparable scale (Song et al., 20 May 2025).

7. Implementation Summary and Design Principles

Key implementation directives:

Use a unified training loop with equal-weight joint loss ( $L_\text{SL} + L_\text{NTP}$ ).
Insert marker tokens after each schema column; employ marker classification for schema selection.
Employ local bidirectional attention among schema tokens and prefix.
During SQL prediction, mask non-gold schema and inject noise sampled via confusion-aware NSS; cache confusion scores after the first epoch for efficiency.
Employ recall-oriented thresholding for schema marker selection in inference (Song et al., 20 May 2025).

JOLT-SQL provides the necessary algorithms, loss functions, and attention mechanisms to facilitate reproduction or extension of the framework for robust text-to-SQL translation.

PDF Markdown Chat (Pro)

References (1)

JOLT-SQL: Joint Loss Tuning of Text-to-SQL with Confusion-aware Noisy Schema Sampling (2025)

Whiteboard

Generate a whiteboard explanation of this topic.

Topic to Video (Beta)

Generate a video overview of this topic.

Follow Topic

Get notified by email when new papers are published related to JOLT-SQL.