Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
134 tokens/sec
GPT-4o
10 tokens/sec
Gemini 2.5 Pro Pro
47 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

GRACEFUL: UDF Cost Estimator

Updated 1 July 2025
  • Cost estimators are systems that predict resource usage—like runtime and financial cost—for executing processes and database queries.
  • GRACEFUL integrates UDF logic into a joint graph model using GNNs and branch frequency estimation to yield accurate, transferable cost predictions.
  • The approach enables cost-based query optimization, informing UDF filter placement to achieve performance speedups of up to 50x over default heuristics.

A cost estimator is a computational or algorithmic system designed to predict the resource expenditure—commonly time or financial cost—associated with executing a process, project, or computational plan. In database management systems (DBMS), cost estimators play a crucial role in query optimization by predicting the runtime or computational cost of executing plans, including those involving User-Defined Functions (UDFs). In the context of modern DBMSs where UDFs are pervasive and optimization is complicated by limited cost model support for user code, GRACEFUL introduces a new learned cost estimation approach specifically targeting UDF-aware query processing (2503.23863).

1. GRACEFUL: Purpose and Innovations

GRACEFUL (GRAph-based Cost Estimator For User-defined Logic) addresses the core challenge that traditional query optimizers either ignore UDFs in cost modeling or assign them overly simplistic and static costs. This omission leads to frequent suboptimal query plans, particularly when UDFs dominate execution costs or have unpredictable behavior across varied input data or code paths.

The main innovations of GRACEFUL are:

  • Joint structured modeling of query plans and UDFs using a heterogeneous, annotated graph-based representation, enabling integration of operator, data, and code logic.
  • Transfer learning and generalization: The system provides accurate (“zero-shot”) runtime predictions for unseen queries, UDFs, and data, leveraging its transferable representation.
  • Empowering cost-based optimization: The model supplies quantitative cost predictions enabling cost-based decisions for UDF filter placement, with documented end-to-end speedups up to 50x by informing when to pull up or push down UDF predicates.

2. Technical Architecture and Methodology

GRACEFUL encodes both UDFs and query plans as richly annotated directed acyclic graphs (DAGs), which are then consumed by a machine learning model based on Graph Neural Networks (GNNs) combined with a Multi-Layer Perceptron (MLP) regression head. The technical flow comprises:

  • UDF Graph Extraction: Python UDFs are parsed into control-flow graphs (CFGs). Nodes represent code constructs (computation, branch, loop, INV for invocation, RET for return), with edges encoding control paths. CFGs are made DAGs for GNN processing.
  • Graph Annotation: Each node is annotated with features including:
    • Code features: number of loops, arithmetic/string ops, library calls, data types.
    • Data flow: Row cardinalities entering/exiting nodes, as determined by query plan context and database statistics.
  • Branch Hit-Ratio Estimation: The frequency of each branch’s execution in the UDF is estimated by converting control flow predicates into equivalent SQL WHERE clauses and invoking the DBMS's cardinality estimator, thus quantifying how often UDF branches are taken under specific query predicates and joins.
  • Joint Query-UDF Graph Construction: The annotated UDF graph is embedded into the query plan graph, connecting via column access and data flow edges at the points of UDF invocation/return.
  • Learning/Inference Pipeline:
    • The joint graph is passed through a GNN to yield a vector representation.
    • The MLP head produces a runtime prediction y^\hat{y} for the whole query.
    • Training uses regression against measured runtimes from benchmark executions.

Q-error is used for evaluation: Q=max(y^y,yy^)Q = \max\left(\frac{\hat{y}}{y}, \frac{y}{\hat{y}}\right) where yy is actual observed runtime, y^\hat{y} the prediction.

GRACEFUL also models cost uncertainty after filters with unknown selectivity by predicting cost distributions under a range of hypothetical selectivities, supporting robust optimization.

3. Synthetic Dataset and Benchmark

To facilitate research and evaluation, GRACEFUL provides a large public benchmark:

  • Scope: 90,000+ queries across 20 datasets (18 real + TPC-H, SSB, IMDB), each involving scalar Python UDFs in filters or projections, with up to 5 joins and 21 filters per query.
  • UDFs: Programmatically generated to match realistic cloud DB UDF characteristics (multiple branches, loops, function/library calls, etc.), and checked for semantic/data correctness (e.g., no runtime errors).
  • Data: True runtimes for all queries are measured on DuckDB, providing a ground truth reference for training and evaluation.

This dataset is the first large-scale public resource for learned UDF cost modeling, covering a breadth of code structures, query types, data distributions, and execution contexts.

4. Evaluation and Empirical Results

Experiments confirm that GRACEFUL achieves high predictive accuracy and practical benefit:

  • Accuracy: Median Q-error is typically below 1.3 with accurate cardinality estimates (actual or DeepDB-based), and the approach generalizes across unseen UDFs, queries, and datasets via leave-one-database-out validation.
  • Superiority to baselines: Outperforms both flat vector and graph-based baselines that lack integration of UDF logic, branch-frequency modeling, or explicit code-data flow representation.
  • End-to-end optimization impact: By leveraging cost estimates, optimizers can decide to pull up expensive UDF filters (apply after joins when beneficial), realizing speedups up to 50x compared to the standard push-down heuristic (as in cases where filtering early applies a costly UDF to unnecessary tuples).
  • Robust Advising: Under selectivity uncertainty, GRACEFUL enables strategies (conservative, expected-area, or upper-bound) to pick pull-up/push-down plans minimizing expected cost or risk of regressions.

5. Impact on DBMS Optimization

GRACEFUL fundamentally enables cost-based decision-making for UDF predicate placement in physical query plans. This allows query optimizers to:

  • Replace heuristics (e.g., blindly push down filters) with quantitative cost predictions tailored to code and data context.
  • Model the cost effect of UDF branch complexity, selectivity, data size, and code structure in context, instrumenting trade-offs between CPU, IO, and network cost based on the actual workload.
  • Under selectivity uncertainty (post-UDF filter placement), optimize for worst-case, mean, or conservative-case outcomes by enumerating plans and hypothetical selectivities, leveraging the cost estimator’s distributional output.

In practice, the approach yields significant speedups (median >1.3x, maxima of 50x) with minimal risk of performance regression when using conservative strategies.

6. Limitations and Applications

GRACEFUL’s generalization capabilities are demonstrated, but the approach relies on accurate cardinality estimates; erroneous data statistics can limit prediction accuracy, as with all learned cost models. A plausible implication is that integrating advanced cardinality estimators can further strengthen performance. The method is validated on scalar Python UDFs and may need extensions for complex or procedural UDFs. The system and dataset are public, fostering ongoing research.

Summary Table: GRACEFUL

Component Description Metric/Example
System Goal Predict runtime of queries with UDFs, enabling cost-based plans Q-error < 1.3 (median)
Representation Annotated joint query + UDF graph (CFG, DAG, with plan, code, data edges) See Figure 1 in paper
Branch statistics Predicted by converting UDF branch predicates to SQL and querying DBMS cardinality estimators Hit-ratio (branch freq.)
ML model Graph Neural Network + MLP regression GNN(Gquery+udf\mathcal{G}_{query+udf})
Optimization impact Informs UDF pull-up/push-down; up to 50x real speedup over defaults End-to-end studies
Dataset (benchmark) 90,000+ labeled queries, 20 DBs, diverse UDFs, runtime ground-truth DuckDB, open resource

GRACEFUL represents a substantial advance in addressing UDF-aware cost modeling for query optimization. By fusing explicit code structure, data flow, query plan, and runtime statistics in an extensible ML framework, it establishes a new standard for practical and accurate cost estimation in modern DBMS environments.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)