SynQL: A Controllable and Scalable Rule-Based Framework for SQL Workload Synthesis for Performance Benchmarking

Published 9 Apr 2026 in cs.DB | (2604.08021v1)

Abstract: Database research and the development of learned query optimisers rely heavily on realistic SQL workloads. Acquiring real-world queries is increasingly difficult, however, due to strict privacy regulations, and publicly released anonymised traces typically strip out executable query text to preserve confidentiality. Existing synthesis tools fail to bridge this training data gap: traditional benchmarks offer too few fixed templates for statistical generalisation, while LLM approaches suffer from schema hallucination fabricating non-existent columns and topological collapse systematically defaulting to simplistic join patterns that fail to stress-test query optimisers. We propose SynQL, a deterministic workload synthesis framework that generates structurally diverse, execution-ready SQL workloads. As a foundational step toward bridging the training-data gap, SynQL targets the core SQL fragment -- multi-table joins with projections, aggregations, and range predicates -- which dominates analytical workloads. SynQL abandons probabilistic text generation in favour of traversing the live database's foreign-key graph to populate an Abstract Syntax Tree (AST), guaranteeing schema and syntactic validity by construction. A configuration vector $Θ$ provides explicit, parametric control over join topology (Star, Chain, Fork), analytical intensity, and predicate selectivity. Experiments on TPC-H and IMDb show that SynQL produces near-maximally diverse workloads (Topological Entropy $H = 1.53$ bits) and that tree-based cost models trained on the synthetic corpus achieve $R² \ge 0.79$ on held-out synthetic test sets with sub-millisecond inference latency, establishing SynQL as an effective foundation for generating training data when production logs are inaccessible.

Abstract PDF Upgrade to Chat

Authors (2)

Summary

The paper introduces SynQL, a deterministic, rule-based framework that generates execution-ready SQL workloads with explicit control over join topologies and predicates.
It demonstrates high topological entropy and near-100% schema validity, outperforming LLM-based methods by avoiding schema hallucination and topological collapse.
Experimental evaluations reveal that SynQL-generated workloads robustly train cost models, achieving excellent predictive power and generalization across diverse query structures.

SynQL: Deterministic and Controllable SQL Workload Synthesis for Learned Database System Evaluation

Motivation and Problem Context

Performance tuning and cost-model training for modern database systems, especially in the context of learned query optimizers, demand large, structurally diverse, and realistic SQL workloads. However, acquiring such datasets is difficult due to privacy, security, and data-governance constraints. Released industry traces (e.g., Snowset, Redset) are anonymized and lack executable SQL content. Existing approaches to synthetic workload generation include manual template benchmarks (TPC-H, TPC-DS, JOB) and, more recently, LLM-based SQL generators. Manual templates are structurally limited, yielding overfitting and poor generalization, while LLM-based synthesis suffers from schema hallucination (invalid table/column references) and topological collapse (monotonous join patterns), with execution success rates under 40% on complex schemas such as those measured in the BIRD and Spider 2.0 evaluations.

SynQL Framework: Architecture and Algorithms

SynQL addresses these limitations via a fully deterministic, rule-based workload synthesis framework for generating execution-ready SQL corpora with explicit, fine-grained control over structural and semantic characteristics. The framework operates in a two-phase pipeline:

Phase I — Topological Traversal: Given the live schema graph (tables as nodes, PK-FK edges as directed arcs), SynQL performs a parametric graph traversal controlled by a topology-bias parameter $\alpha_{\text{shape}}$ . This yields a join subgraph (blueprint) whose structure may be forced towards Star, Chain, or Fork configurations. The process ensures strict schema compliance, since expansions only follow valid PK-FK edges. Sampling is stochastic but reproducible for a fixed random seed, promoting reproducibility in experimental workloads.
Phase II — Semantic Injection and AST Assembly: The join blueprint is populated with projections, aggregations, and valid predicates. Aggregation injection is governed by $P_{\text{agg}}$ , which determines the probability of wrapping numeric columns with SUM or AVG. The WHERE clause appearance and density are controlled by $P_{\text{where}}$ and $K_{\text{pred}}$ , with all literal values sampled from true catalogued domains to ensure type correctness. The resulting abstract syntax tree (AST) is compiled into syntactically valid, executable SQL, enforcing SELECT-list/aggregation/GROUP BY invariants and making syntax errors mathematically impossible.

Workload generation is performed by repeating Phases I and II as an outer loop, with the configuration vector $\Theta = (\alpha_{\text{shape}}, K_{\text{join}}, P_{\text{agg}}, P_{\text{where}}, K_{\text{pred}})$ governing join topology, path length, analytical intensity, and predicate complexity.

Empirical Evaluation: Topological Entropy, Predictive Power, and Generalization

Workload Diversity

On both TPC-H and IMDb schemas, SynQL achieves near-maximal topological entropy— $H = 1.53$ bits (TPC-H; maximum for three classes is $\log_2 3 \approx 1.58$ ) and $H = 1.34$ bits (IMDb)—with balanced representation of Star, Chain, and Fork joins. This directly contradicts the mode collapse seen in LLM-based generation, where over 90% of queries default to hub-and-spoke topologies [lei2024spider2], [li2023bird].

Cost Model Performance

SynQL-generated corpora were used to train tree-based query execution time (QET) predictors using features extracted from PostgreSQL's EXPLAIN (21 features, including plan costs, structural metrics, and operator counts). On held-out synthetic queries, XGBoost and Random Forest achieve $R^2$ of 0.987 (TPC-H) and 0.824 (IMDb), with sub-millisecond inference latency. Importantly, per-topology breakdown demonstrates uniform predictive power: models trained only on single topology classes exhibit severe negative transfer (e.g., $R^2=-1.313$ on Star $P_{\text{agg}}$ 0Chain for IMDb), whereas mixed-topology training yields robust generalization across all classes.

Comparison to LLM-Based Generation

Schema Validity: SynQL achieves 100% validity by construction (zero syntax errors), in contrast to LLMs with failure rates over 60% on enterprise schemas [lei2024spider2], [li2023bird].
Topological Control: The explicit $P_{\text{agg}}$ 1 parameter allows generation of rare and complex join structures (chains, deep forks), vital for stress-testing cost models—functionality absent in black-box LLM techniques.
Experimental Reproducibility: Given fixed $P_{\text{agg}}$ 2 and seed, SynQL emits byte-identical SQL, unlike non-deterministic, prompt-sensitive LLMs.

Limitations and Future Directions

Expressive Fragment Coverage: SynQL currently supports multi-table SELECT with joins, aggregations, and predicates. It does not yet handle nested subqueries, set operations, CTEs, or window functions. However, the AST-centric architecture is modular and extensible.
Engine Portability: While query generation is engine-agnostic, the feature extraction pipeline targets PostgreSQL. Adapting to other DBMSs (Spark SQL, Snowflake, MySQL) requires feature remapping.
Production Deployment and Transfer: The evaluation focuses on models trained/tested on SynQL corpora. Transfer to real-world workloads remains as future work, as with all synthetic-data-based approaches, but SynQL's parametric configuration enables more faithful emulation of production workload distributions.
Diversity Metrics: Current diversity metrics are based on coarse topology classes. Integration of finer-grained join graph metrics could provide deeper coverage analysis.

Implications for Learned Database Systems

SynQL's design and empirical results provide evidence that schema-aware, deterministic workload synthesis is necessary for robust learned QET/cost estimation, especially when privacy barriers preclude the use of real query logs. Explicit topological and predicate controls make SynQL suitable for stress-testing optimizer generalization under rare or structurally diverse query classes, a longstanding challenge inadequately addressed by fixed benchmarks or LLM-based synthesis. Its modular design facilitates extensibility towards richer SQL fragments and other relational engines, and positions it as a platform on which to pre-train robust learned optimizers, even in cross-engine or cross-enterprise settings.

Conclusion

SynQL demonstrates that rule-based, topology-controllable query synthesis resolves fundamental limitations of both fixed-template benchmarks and generative LLM-based SQL generators for learned query optimization research. The framework’s systematic approach yields high topological entropy, excellent schema validity, and supports robust model generalization—directly addressing the training bottleneck in cost-model development for database systems. Future work includes expanding SQL fragment coverage, validating production transfer, enabling plug-and-play engine portability, and refining workload diversity metrics to cover more granular structural dimensions.

Reference:

"SynQL: A Controllable and Scalable Rule-Based Framework for SQL Workload Synthesis for Performance Benchmarking" (2604.08021)

Markdown Report Issue