- The paper introduces SynQL, a deterministic, rule-based framework that generates execution-ready SQL workloads with explicit control over join topologies and predicates.
- It demonstrates high topological entropy and near-100% schema validity, outperforming LLM-based methods by avoiding schema hallucination and topological collapse.
- Experimental evaluations reveal that SynQL-generated workloads robustly train cost models, achieving excellent predictive power and generalization across diverse query structures.
SynQL: Deterministic and Controllable SQL Workload Synthesis for Learned Database System Evaluation
Motivation and Problem Context
Performance tuning and cost-model training for modern database systems, especially in the context of learned query optimizers, demand large, structurally diverse, and realistic SQL workloads. However, acquiring such datasets is difficult due to privacy, security, and data-governance constraints. Released industry traces (e.g., Snowset, Redset) are anonymized and lack executable SQL content. Existing approaches to synthetic workload generation include manual template benchmarks (TPC-H, TPC-DS, JOB) and, more recently, LLM-based SQL generators. Manual templates are structurally limited, yielding overfitting and poor generalization, while LLM-based synthesis suffers from schema hallucination (invalid table/column references) and topological collapse (monotonous join patterns), with execution success rates under 40% on complex schemas such as those measured in the BIRD and Spider 2.0 evaluations.
SynQL Framework: Architecture and Algorithms
SynQL addresses these limitations via a fully deterministic, rule-based workload synthesis framework for generating execution-ready SQL corpora with explicit, fine-grained control over structural and semantic characteristics. The framework operates in a two-phase pipeline:
- Phase I — Topological Traversal: Given the live schema graph (tables as nodes, PK-FK edges as directed arcs), SynQL performs a parametric graph traversal controlled by a topology-bias parameter αshape. This yields a join subgraph (blueprint) whose structure may be forced towards Star, Chain, or Fork configurations. The process ensures strict schema compliance, since expansions only follow valid PK-FK edges. Sampling is stochastic but reproducible for a fixed random seed, promoting reproducibility in experimental workloads.
- Phase II — Semantic Injection and AST Assembly: The join blueprint is populated with projections, aggregations, and valid predicates. Aggregation injection is governed by Pagg, which determines the probability of wrapping numeric columns with SUM or AVG. The WHERE clause appearance and density are controlled by Pwhere and Kpred, with all literal values sampled from true catalogued domains to ensure type correctness. The resulting abstract syntax tree (AST) is compiled into syntactically valid, executable SQL, enforcing SELECT-list/aggregation/GROUP BY invariants and making syntax errors mathematically impossible.
Workload generation is performed by repeating Phases I and II as an outer loop, with the configuration vector Θ=(αshape,Kjoin,Pagg,Pwhere,Kpred) governing join topology, path length, analytical intensity, and predicate complexity.
Empirical Evaluation: Topological Entropy, Predictive Power, and Generalization
Workload Diversity
On both TPC-H and IMDb schemas, SynQL achieves near-maximal topological entropy—H=1.53 bits (TPC-H; maximum for three classes is log23≈1.58) and H=1.34 bits (IMDb)—with balanced representation of Star, Chain, and Fork joins. This directly contradicts the mode collapse seen in LLM-based generation, where over 90% of queries default to hub-and-spoke topologies [lei2024spider2], [li2023bird].
SynQL-generated corpora were used to train tree-based query execution time (QET) predictors using features extracted from PostgreSQL's EXPLAIN (21 features, including plan costs, structural metrics, and operator counts). On held-out synthetic queries, XGBoost and Random Forest achieve R2 of 0.987 (TPC-H) and 0.824 (IMDb), with sub-millisecond inference latency. Importantly, per-topology breakdown demonstrates uniform predictive power: models trained only on single topology classes exhibit severe negative transfer (e.g., R2=−1.313 on StarPagg0Chain for IMDb), whereas mixed-topology training yields robust generalization across all classes.
Comparison to LLM-Based Generation
- Schema Validity: SynQL achieves 100% validity by construction (zero syntax errors), in contrast to LLMs with failure rates over 60% on enterprise schemas [lei2024spider2], [li2023bird].
- Topological Control: The explicit Pagg1 parameter allows generation of rare and complex join structures (chains, deep forks), vital for stress-testing cost models—functionality absent in black-box LLM techniques.
- Experimental Reproducibility: Given fixed Pagg2 and seed, SynQL emits byte-identical SQL, unlike non-deterministic, prompt-sensitive LLMs.
Limitations and Future Directions
- Expressive Fragment Coverage: SynQL currently supports multi-table SELECT with joins, aggregations, and predicates. It does not yet handle nested subqueries, set operations, CTEs, or window functions. However, the AST-centric architecture is modular and extensible.
- Engine Portability: While query generation is engine-agnostic, the feature extraction pipeline targets PostgreSQL. Adapting to other DBMSs (Spark SQL, Snowflake, MySQL) requires feature remapping.
- Production Deployment and Transfer: The evaluation focuses on models trained/tested on SynQL corpora. Transfer to real-world workloads remains as future work, as with all synthetic-data-based approaches, but SynQL's parametric configuration enables more faithful emulation of production workload distributions.
- Diversity Metrics: Current diversity metrics are based on coarse topology classes. Integration of finer-grained join graph metrics could provide deeper coverage analysis.
Implications for Learned Database Systems
SynQL's design and empirical results provide evidence that schema-aware, deterministic workload synthesis is necessary for robust learned QET/cost estimation, especially when privacy barriers preclude the use of real query logs. Explicit topological and predicate controls make SynQL suitable for stress-testing optimizer generalization under rare or structurally diverse query classes, a longstanding challenge inadequately addressed by fixed benchmarks or LLM-based synthesis. Its modular design facilitates extensibility towards richer SQL fragments and other relational engines, and positions it as a platform on which to pre-train robust learned optimizers, even in cross-engine or cross-enterprise settings.
Conclusion
SynQL demonstrates that rule-based, topology-controllable query synthesis resolves fundamental limitations of both fixed-template benchmarks and generative LLM-based SQL generators for learned query optimization research. The framework’s systematic approach yields high topological entropy, excellent schema validity, and supports robust model generalization—directly addressing the training bottleneck in cost-model development for database systems. Future work includes expanding SQL fragment coverage, validating production transfer, enabling plug-and-play engine portability, and refining workload diversity metrics to cover more granular structural dimensions.
Reference:
"SynQL: A Controllable and Scalable Rule-Based Framework for SQL Workload Synthesis for Performance Benchmarking" (2604.08021)