OurBenchSyn Queries: Enterprise SQL Debugging

Updated 2 February 2026

OurBenchSyn Queries is a benchmark suite that tests LLMs by injecting realistic syntax errors into complex enterprise SQL scripts.
It leverages reverse-engineering of production ETL scripts to generate diverse queries with varying structural complexity and error types.
The benchmark employs execution-free metrics—exact match, graph match, and AST-edit improvements—to assess repair quality and model progress.

OurBenchSyn Queries represent a new, large-scale benchmark suite for evaluating the ability of automated systems, particularly LLMs, to identify and repair syntax errors in complex enterprise SQL workloads. Developed as part of the "OurBench" framework, OurBenchSyn introduces a unified, data-centric methodology grounded in the reverse engineering of real-world ETL/analytics SQL scripts, explicit injection of realistic faults, and an execution-free evaluation regime tailored to high-complexity, production-grade queries (Ye et al., 26 Jan 2026).

1. Automated Synthesis of Syntax Error Queries

OurBenchSyn is constructed via an end-to-end reverse-engineering workflow that generates realistic, single-error SQL queries from a curated pool of enterprise ETL scripts:

Seed Curation: Over 1,000 production Hive/Spark SQL scripts are filtered to ensure high structural complexity, measured using a scoring function: $C(q) = \alpha \cdot (\mathrm{DepthAST}(q) + \mathrm{WidthAST}(q)) + \mathrm{Length}(q) > T$ , where DepthAST is the maximum AST depth, WidthAST the maximum AST fan-out, and Length the script line count. All seeds are statically validated for clean execution using Calcite-based analysis.
Taxonomy Development: A three-level taxonomy of bugs is developed from 268 real-world error cases: (i) Macro type (DML, DDL, Semantic, Logic), (ii) Construct (e.g., SELECT list, JOINs, CASE, Parentheses), (iii) Atomic fault (e.g., missing commas, keyword typos).
Bug Injection: For each script $q_t$ $q_{t}$ :
1. Parse $q_t$ to obtain structural profiles (clause counts, AST patterns, nesting).
2. Select $k$ likely bug types $t_1, \ldots, t_k$ whose structural patterns occur in $q_t$ .
3. For each $t_i$ , select a random occurrence $s$ matching the pattern and apply a transformation $f_i$ (token deletion, substitution, or insertion) at a specific position. The mutated query $b$ is: $b = q_t[1:pos-1] \mathop{\|} f_i(q_t[pos]) \mathop{\|} q_t[pos+1:end]$ .
4. The resulting buggy query $b$ is statically validated to ensure a single explicit syntax error (Calcite/TQS), with easy/unsolvable cases eliminated via "attack-defense" filtering and manual curation.
Coverage: Injected faults span missing/extra punctuation, unmatched parentheses/brackets, keyword typos, clause order errors, list omissions, DML/DDL mismatches, and dialect incompatibilities.

2. Query Scale, Structural Complexity, and Diversity

OurBenchSyn benchmarks are intentionally calibrated for "enterprise scale":

Size Metrics (on 469 OurBenchSyn tasks):
- Mean script length: 163.7 lines ( $\approx$ 497 tokens)
- Mean AST depth: 8.93; width: 11.12; median width: 11, range [6,18]
- Mean number of SQL functions/subqueries: 21.62 per script
- Distribution by AST depth: $P[X \leq 7]\approx0.25$ ; $P[7 < X \leq 9]\approx0.50$ ; $P[X > 9]\approx0.25$
Diversity is achieved by:
- Error positions uniformly shuffled among SELECT, FROM, WHERE, GROUP BY, and JOIN clauses.
- Query domains spanning finance, retail, healthcare, social media, with scenario semantics transferred but structure preserved.
- Adversarial selection: queries solved by most LLMs are filtered out, and positions are randomized for retained examples.

3. Execution-Free Evaluation Metrics

Enterprise scripts are too heavy for live execution and may admit multiple valid variants after repair. OurBenchSyn employs three static correctness metrics:

Exact Match (EM): Strict string equality: $\mathrm{EM} = \frac{1}{N} \sum_{i=1}^N 1[\hat{y}_i = q_i^*]$ .
Graph Match (GM): Isomorphism of normalized Calcite logical plans: $\mathrm{GM} = \frac{1}{N} \sum 1[\mathrm{Graph}(\hat{y}_i) \cong \mathrm{Graph}(q^*_i)]$ , capturing semantic equivalence modulo harmless rewrites.
Modify-Better (MB): Normalized AST-edit-distance improvement: $MB = \frac{1}{N} \sum 1[d(\hat{y}_i, q^*_i) < d(b_i, q^*_i)]$ , detecting partial progress when a model's repair gets "closer."

These metrics allow nuanced grading of repair quality without executing queries.

4. Representative Examples and Observed Challenge Patterns

Illustrative error classes and model failure modes include:

Missing Comma in SELECT: In a 120-line, seven-CTE script,
1 2 3 4 5
SELECT user_id user_name last_login FROM user_activity;
Error: “Encountered 'user_name' at line 3; was expecting one of: ','.” Fix: Insert a comma after "user_id".

Challenge: Due to long context, LLMs often hallucinate comma placement or fail to localize the precise line.

Unmatched Parenthesis in WHERE Clause:

WITH orders_filtered AS (
  SELECT * FROM orders WHERE (status = 'shipped' OR status = 'delivered'
),
shipment_stats AS (...)

Error: “Encountered ',' at line 2; missing ')'.” Fix: Close parenthesis immediately after delivered' to restore structure.

Challenge: LLMs frequently mistrack bracket closure in deeply nested or widely separated code blocks, triggering error cascades.

Empirical findings indicate:

Good performance on near-boundary errors, severe degradation as script length and nesting increase.
Error localization in nested CTEs or window frames is consistently poor.
Models often improve MB but still fail GM, suggesting limited but nontrivial incremental progress.

5. Performance Benchmarking and Empirical Impact

Across 469 OurBenchSyn queries, nearly 30 LLMs were evaluated. Results reveal a substantial gap between human-level and model ability:

Model	Exact Match (%)	Graph Match (%)	Modify-Better (%)
Claude-4-Sonnet	36.46	36.5	–
Most LLMs (<20%)	<20	<20	–

No LLM approaches even 40% accuracy under strict semantic equivalence. Most struggle on the majority of tasks, underscoring the difficulty of repairing large-scale, realistic SQL scripts with nontrivial, localized bugs (Ye et al., 26 Jan 2026).

6. Distinctions and Applications

OurBenchSyn is the first known corpus to target:

Enterprise-grade SQL debugging with scripts at production scale, exhibiting large, deep ASTs and broad operator coverage.
Realistic, taxonomically-guided error synthesis tied to observed fault distributions from production codebases.
Execution-free, structure-preserving evaluation for both exact and "partial improvement" repairs, eschewing candidate execution or single-reference lock-in.

Applications include:

Benchmarking and comparative evaluation of LLM-based SQL debuggers, traditional static analyzers, or hybrid repair systems.
Stress-testing new debugging strategies, with fine-grained metrics distinguishing exact, semantic, and partial repair quality.
Analyzing repair failure modes in high-complexity or high-diversity SQL settings.

7. Conclusion and Outlook

OurBenchSyn benchmarks fill a previously unaddressed gap in large-scale program debugging, specifically for SQL. Their automated construction, structural diversity, and rigorous execution-free evaluation establish a new standard for future research on code repair, static analysis, and LLM robustness in data engineering. Results to date indicate that even state-of-the-art generative models are far from robust SQL debuggers on enterprise codebases, motivating continued work on scalable, semantics-aware error localization and synthesis for complex database workloads (Ye et al., 26 Jan 2026).

Markdown Upgrade to Chat

References (1)

Beyond Text-to-SQL: Can LLMs Really Debug Enterprise ETL SQL? (2026)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to OurBenchSyn Queries.

OurBenchSyn Queries: Enterprise SQL Debugging

1. Automated Synthesis of Syntax Error Queries

2. Query Scale, Structural Complexity, and Diversity

3. Execution-Free Evaluation Metrics

4. Representative Examples and Observed Challenge Patterns

5. Performance Benchmarking and Empirical Impact

6. Distinctions and Applications

7. Conclusion and Outlook

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Don't miss out on important new AI/ML research

Sign up for free to explore the frontiers of research

OurBenchSyn Queries: Enterprise SQL Debugging

1. Automated Synthesis of Syntax Error Queries

2. Query Scale, Structural Complexity, and Diversity

3. Execution-Free Evaluation Metrics

4. Representative Examples and Observed Challenge Patterns

5. Performance Benchmarking and Empirical Impact

6. Distinctions and Applications

7. Conclusion and Outlook

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Related Topics

Don't miss out on important new AI/ML research

Sign up for free to explore the frontiers of research