OurBenchSyn Queries: Enterprise SQL Debugging
- OurBenchSyn Queries is a benchmark suite that tests LLMs by injecting realistic syntax errors into complex enterprise SQL scripts.
- It leverages reverse-engineering of production ETL scripts to generate diverse queries with varying structural complexity and error types.
- The benchmark employs execution-free metrics—exact match, graph match, and AST-edit improvements—to assess repair quality and model progress.
OurBenchSyn Queries represent a new, large-scale benchmark suite for evaluating the ability of automated systems, particularly LLMs, to identify and repair syntax errors in complex enterprise SQL workloads. Developed as part of the "OurBench" framework, OurBenchSyn introduces a unified, data-centric methodology grounded in the reverse engineering of real-world ETL/analytics SQL scripts, explicit injection of realistic faults, and an execution-free evaluation regime tailored to high-complexity, production-grade queries (Ye et al., 26 Jan 2026).
1. Automated Synthesis of Syntax Error Queries
OurBenchSyn is constructed via an end-to-end reverse-engineering workflow that generates realistic, single-error SQL queries from a curated pool of enterprise ETL scripts:
- Seed Curation: Over 1,000 production Hive/Spark SQL scripts are filtered to ensure high structural complexity, measured using a scoring function: , where DepthAST is the maximum AST depth, WidthAST the maximum AST fan-out, and Length the script line count. All seeds are statically validated for clean execution using Calcite-based analysis.
- Taxonomy Development: A three-level taxonomy of bugs is developed from 268 real-world error cases: (i) Macro type (DML, DDL, Semantic, Logic), (ii) Construct (e.g., SELECT list, JOINs, CASE, Parentheses), (iii) Atomic fault (e.g., missing commas, keyword typos).
- Bug Injection: For each script :
- Parse to obtain structural profiles (clause counts, AST patterns, nesting).
- Select likely bug types whose structural patterns occur in .
- For each , select a random occurrence matching the pattern and apply a transformation (token deletion, substitution, or insertion) at a specific position. The mutated query is: .
- The resulting buggy query is statically validated to ensure a single explicit syntax error (Calcite/TQS), with easy/unsolvable cases eliminated via "attack-defense" filtering and manual curation.
Coverage: Injected faults span missing/extra punctuation, unmatched parentheses/brackets, keyword typos, clause order errors, list omissions, DML/DDL mismatches, and dialect incompatibilities.
2. Query Scale, Structural Complexity, and Diversity
OurBenchSyn benchmarks are intentionally calibrated for "enterprise scale":
- Size Metrics (on 469 OurBenchSyn tasks):
- Mean script length: 163.7 lines (497 tokens)
- Mean AST depth: 8.93; width: 11.12; median width: 11, range [6,18]
- Mean number of SQL functions/subqueries: 21.62 per script
- Distribution by AST depth: ; ;
- Diversity is achieved by:
- Error positions uniformly shuffled among SELECT, FROM, WHERE, GROUP BY, and JOIN clauses.
- Query domains spanning finance, retail, healthcare, social media, with scenario semantics transferred but structure preserved.
- Adversarial selection: queries solved by most LLMs are filtered out, and positions are randomized for retained examples.
3. Execution-Free Evaluation Metrics
Enterprise scripts are too heavy for live execution and may admit multiple valid variants after repair. OurBenchSyn employs three static correctness metrics:
- Exact Match (EM): Strict string equality: .
- Graph Match (GM): Isomorphism of normalized Calcite logical plans: , capturing semantic equivalence modulo harmless rewrites.
- Modify-Better (MB): Normalized AST-edit-distance improvement: , detecting partial progress when a model's repair gets "closer."
These metrics allow nuanced grading of repair quality without executing queries.
4. Representative Examples and Observed Challenge Patterns
Illustrative error classes and model failure modes include:
- Missing Comma in SELECT: In a 120-line, seven-CTE script,
Error: “Encountered 'user_name' at line 3; was expecting one of: ','.” Fix: Insert a comma after "user_id".1 2 3 4 5
SELECT user_id user_name last_login FROM user_activity;
Challenge: Due to long context, LLMs often hallucinate comma placement or fail to localize the precise line.
- Unmatched Parenthesis in WHERE Clause:
1 2 3 4 |
WITH orders_filtered AS ( SELECT * FROM orders WHERE (status = 'shipped' OR status = 'delivered' ), shipment_stats AS (...) |
Challenge: LLMs frequently mistrack bracket closure in deeply nested or widely separated code blocks, triggering error cascades.
Empirical findings indicate:
- Good performance on near-boundary errors, severe degradation as script length and nesting increase.
- Error localization in nested CTEs or window frames is consistently poor.
- Models often improve MB but still fail GM, suggesting limited but nontrivial incremental progress.
5. Performance Benchmarking and Empirical Impact
Across 469 OurBenchSyn queries, nearly 30 LLMs were evaluated. Results reveal a substantial gap between human-level and model ability:
| Model | Exact Match (%) | Graph Match (%) | Modify-Better (%) |
|---|---|---|---|
| Claude-4-Sonnet | 36.46 | 36.5 | – |
| Most LLMs (<20%) | <20 | <20 | – |
No LLM approaches even 40% accuracy under strict semantic equivalence. Most struggle on the majority of tasks, underscoring the difficulty of repairing large-scale, realistic SQL scripts with nontrivial, localized bugs (Ye et al., 26 Jan 2026).
6. Distinctions and Applications
OurBenchSyn is the first known corpus to target:
- Enterprise-grade SQL debugging with scripts at production scale, exhibiting large, deep ASTs and broad operator coverage.
- Realistic, taxonomically-guided error synthesis tied to observed fault distributions from production codebases.
- Execution-free, structure-preserving evaluation for both exact and "partial improvement" repairs, eschewing candidate execution or single-reference lock-in.
Applications include:
- Benchmarking and comparative evaluation of LLM-based SQL debuggers, traditional static analyzers, or hybrid repair systems.
- Stress-testing new debugging strategies, with fine-grained metrics distinguishing exact, semantic, and partial repair quality.
- Analyzing repair failure modes in high-complexity or high-diversity SQL settings.
7. Conclusion and Outlook
OurBenchSyn benchmarks fill a previously unaddressed gap in large-scale program debugging, specifically for SQL. Their automated construction, structural diversity, and rigorous execution-free evaluation establish a new standard for future research on code repair, static analysis, and LLM robustness in data engineering. Results to date indicate that even state-of-the-art generative models are far from robust SQL debuggers on enterprise codebases, motivating continued work on scalable, semantics-aware error localization and synthesis for complex database workloads (Ye et al., 26 Jan 2026).