RuleScript: Engine-Agnostic Query Rewrite DSL
- RuleScript is a domain-specific language that enables engine-agnostic definition, verification, and deployment of query rewrite rules in relational databases.
- It separates rule definition from execution via explicit match and transform phases and employs formal verification using QED for rule correctness.
- By facilitating reusable, formally verified query rewrites, RuleScript enhances optimizer reliability, portability, and performance across diverse database engines.
RuleScript is an engine-agnostic domain-specific language (DSL) designed for writing, verifying, and deploying logical query rewrite rules in relational database systems. It separates rule definition from execution infrastructure, employs a core language based on relational algebra with bag semantics, and incorporates an explicit distinction between matching and transformation phases. RuleScript achieves formal verification of rewrite correctness for all schemas and expressions, supports custom operator extensibility, and enables a "write once, deploy everywhere" paradigm for query optimization across diverse database engines (Pan et al., 7 May 2026).
1. Motivation and Problem Definition
Logical query plan rewriting is a cornerstone of relational database query optimization, transforming queries into equivalent yet more efficient forms. In conventional systems such as Apache Calcite (Java-based, tightly coupled to its RelNode API) and CockroachDB’s Optgen (bespoke DSL per engine), rewrite rules are handwritten, engine-specific, and interwoven with the execution engine’s internals. These implementations commonly lack formal correctness guarantees, resulting in semantics-preservation being assumed rather than proven—a deficiency that has led to optimizer-induced errors and incorrect query results in MySQL, SQLite, PostgreSQL, and other systems.
Furthermore, this architecture impedes portability: refactoring or developing a new backend requires rule reimplementation, raising engineering cost and the risk of divergence or subtle bugs across backends. RuleScript’s principal design goals are thus: engine-agnostic specification, universal rule correctness via formal verification, extensibility to new algebraic operators (e.g., SemiJoin, Window), and reduction of per-rule boilerplate through a declarative paradigm.
2. Core DSL and Pattern Matching Architecture
RuleScript’s core language is inspired by bag-semantics relational algebra, supporting six primary operators:
- Selection (Filter):
- Projection (Project):
- Inner Join:
- Union:
- Distinct:
- Aggregation:
Each operator is formally interpreted over bags of tuples, leveraging uninterpreted symbols for functions, predicates, aggregate functions, and plan fragments.
The pattern-matching grammar is defined in BNF as follows:
9
RuleScript separates the rule lifecycle into two explicit phases:
- Match phase: For a rule , given a concrete plan , the system instantiates all symbol variables (, , 0, 1) such that 2. If this holds, the rule fires.
- Transform phase: Recorded symbol instantiations 3 are used to construct the rewritten plan 4.
These phases are governed by structural inference rules. For example, the “Filter” operator matching is handled via recursive pattern decomposition and binding extraction.
3. Formal Verification and Equivalence Checking
A defining feature of RuleScript is universal formal verification of rule correctness. For any rule 5, the system generates an equivalence obligation:
6
This obligation is submitted to QED (Query Equivalence Decider), which encodes plans as semi-ring polynomials over uninterpreted functions and bags. QED performs normalization and SMT-based reasoning to discharge or refute equivalence, ensuring that no counterexample exists for any schema, function, or plan instance. When custom operators are involved, rules are expanded by semantic definitions before checking.
Verification proceeds as follows:
- Expand custom operators to core-only definitions.
- Encode patterns into QED’s formalism.
- Invoke QED for equivalence. Acceptance occurs if QED proves equivalence; otherwise, rules are rejected or flagged (e.g., due to timeouts or counterexamples).
Empirically, all 33 reimplemented Calcite rules were verified within 5 seconds each.
4. Extensibility and Custom Operator Semantics
RuleScript supports extensibility through first-class custom operator definitions. A custom operator definition has the syntax:
0
For example, the left SemiJoin operator is defined by expansion into primitive Filter and existential quantification:
1
This extensibility allows users to capture backend-specific semantics within the engine-agnostic language, while ensuring that formal correctness and universal equivalence obligations are maintained by always expanding non-core constructs before equivalence checking.
5. Integration: Adapter Architecture and Deployment
Backend integration is achieved via lightweight adapters. Each backend provides:
- Match handlers: These detect whether a plan node aligns with a core/custom operator pattern and extract child nodes and expressions.
- Transform handlers: These construct new plan nodes in the backend’s native API from recorded bindings.
A RuleAdapter consists of a catalog of rule pairs (“from” and “to” patterns) and a registry of operator handlers. During query optimization, a generic engine traverses plan trees, invokes handlers, and applies rewrites.
Adapters in practice require a one-time implementation effort (Calcite: 518 LOC; CockroachDB Optgen: 836 LOC; DataFusion interpreter: 1657 LOC). After adapter creation, reusing and deploying new rewrite rules is backend-independent and code-free per individual rule.
6. Empirical Evaluation and Performance
The RuleScript system was evaluated across several dimensions:
- Rule Coverage: 33 of 91 Calcite core rules (spanning Aggregation, Filter, Project, Join, SetOps) were ported, covering major transformation patterns: Transpose (6/21), Merge (7/11), Pushdown (4/6), JoinTrans (3/8), Simplify (11/32), Expansion (2/7).
- Portability: Rules were deployed unmodified to CockroachDB and Apache Data Fusion, backends with substantially different execution engines.
- Implementation Effort: RuleScript required 1051 total LOC for 33 rules (median 21 LOC/rule), versus Calcite native (3877 LOC total, median 94 LOC/rule) and Optgen native (594 LOC total, median 18 LOC/rule). This demonstrates a significant reduction in per-rule implementation when using RuleScript, while adding verification and portability.
- Optimization Impact: On TPC-H (22 queries), CockroachDB with RuleScript rules achieved a geometric mean 1.5× speedup over baseline; DataFusion saw commensurate benefits. Rule‐matching overhead was negligible compared to the cost-based search and plan enumeration.
7. Exemplary Rules and Execution Flow
Example 1: SemiJoin–Aggregate Transpose
Pattern:
- From:
SemiJoin(λ (k,v), y. P(k, y), Aggregate(λ x. G(x), λ x. A(x), L), R) - To:
Aggregate(λ x. G(x), λ x. A(x), SemiJoin(λ x, y. P(G(x), y), L, R))
Verification: Expands via the custom SemiJoin definition, reducing the proof to equality over existential predicates and grouping, under the constraint that predicate 7 is independent of aggregation values.
Execution: The system matches rule patterns, instantiates bindings for L, R, G, A, and P on the concrete plan, applies the transformation, and produces the new plan through the target backend’s handler.
Example 2: PruneEmptyFilter
Pattern:
- From:
Filter(λ x. P(x), Empty(σ)) - To:
Empty(σ)
Proof: For any predicate 8, applying a filter to an empty relation yields an empty relation. Verified as a trivial case by the QED solver.
Deployment Declaration (sample):
2 Runtime handling is immediate: any filter matched on an empty input is dropped.
8. Conclusion and Significance
RuleScript offers a principled, declarative, and formally verified framework for specifying query rewrite rules, with direct support for engine-agnostic deployment, rule-level correctness, extensibility via semantic operator definitions, and practical integration through lightweight adapters. The analytic coverage, performance advantages, and engineering efficiency realized in large-scale real-world engines demonstrate its capacity to address long-standing challenges in query optimizer portability and reliability (Pan et al., 7 May 2026).