Join DAG Construction for Query Optimization
- Join DAG construction is a method that builds a directed acyclic graph representing all intermediate join results for systematic query optimization.
- It employs both full enumeration and incremental updating techniques to efficiently manage complex multi-join queries.
- Cost models based on cardinality and selectivity are integral, enabling early selection placement that refines join plans and minimizes intermediate result sizes.
A join Directed Acyclic Graph (join DAG) is a foundational data structure in the representation and optimization of relational database queries involving join operations. Each node in a join DAG corresponds to an intermediate join result over a subset of base relations, and directed edges represent individual join operations. The construction, maintenance, and utilization of join DAGs directly determine the efficiency and scalability of query optimization, particularly when searching for minimal-cost execution strategies in complex queries with multiple joins and selection predicates. Join DAGs support both pre-computed and incremental formation, enabling rapid generation and refinement of optimal join plans as query workloads evolve or as selections are "sprinkled" throughout the query plan [0202035].
1. Formal Definitions and Core Structures
Given a relational schema , the join graph is an undirected graph with as the relations and where join predicates exist. The join DAG expands this, with directed nodes representing all intermediate joins built over subsets .
- Leaf nodes: Singleton sets corresponding to scans of base relations.
- Internal nodes: Each with a scope where and some join predicate connects and .
- Root node: Scope is , representing the complete query result.
For every subset , at most one node exists for each unique join result, maximizing subplan sharing and reducing combinatorial blowup typical of naive join enumeration [0202035].
2. Join DAG Construction Algorithms
2.1 Pre-computation (Full Enumeration)
The standard approach constructs the join DAG bottom-up, iterating over all subsets of in increasing cardinality. For each subset , all partitions into disjoint are examined, and nodes plus edges are added if a join edge exists between the two subsets. Afterward, a dynamic programming pass annotates each node with its minimal construction cost:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 |
function BUILD_JOIN_DAG(V, E):
nodeMap ← new hash-map<bitmask, Node>()
for each i in [1..n]:
mask ← 1 << (i-1)
nodeMap[mask] ← Node(scope= {R_i}, cost=0)
for size from 2 to n:
for each bitmask S of size 'size' over n bits:
for each nonempty proper submask X of S:
Y = S \ X
if EXISTS_JOIN_EDGE(X, Y, E):
left ← nodeMap[X]
right ← nodeMap[Y]
newNodeMask ← S
if nodeMap does not contain newNodeMask:
nodeMap[newNodeMask] ← Node(scope=relations(S))
u ← nodeMap[newNodeMask]
u.addChild(left)
u.addChild(right)
return nodeMap[(1<<n) - 1] |
The resulting join DAG enables exponential sharing: all distinct orderings of joins among the same relations reuse the same intermediate nodes [0202035].
2.2 Incremental (History) Join DAG Formation
For dynamic, workload-driven optimization, the history join DAG supports incremental updates. When a new query with arrives, only the minimal necessary subgraph is constructed, leveraging existing nodes where possible. The incremental algorithm scans for new subsets missing from the current DAG, creates nodes for those, and recomputes costs in affected subgraphs:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 |
function INCREMENTAL_UPDATE(D_hist, R_new, E_new):
maskNew ← bitmask of R_new
existingMasks ← set of masks in D_hist
neededMasks ← all nonempty submasks of maskNew
for each mask S in neededMasks in increasing popcount:
if S ∉ existingMasks:
for each nonempty proper submask X of S:
Y = S \ X
if (X ∈ existingMasks or X newly added) and
(Y ∈ existingMasks or Y newly added) and
EXISTS_JOIN_EDGE(X, Y, E_hist ∪ E_new):
create node for S if not exists
link children X, Y
existingMasks ∪= {S}
break
for each newly created node u in topological order:
compute cost(u) as before
return updated D_hist |
3. Cost Models and Complexity
Each join DAG node is annotated with cost estimates (often cardinality-based) and outputs, propagated bottom-up:
- For a join producing node from :
- ,
- where absorbs per-tuple resource costs.
Complexity is dictated by the number of subsets (nodes) and possible partitions per subset:
- Nodes:
- Edges: up to
- Total time: due to enumerating all binary partitions for all subsets.
Incremental construction and caching can amortize this cost in query workloads with recurring access patterns or modest query width [0202035].
4. Selection Placement and Plan Refinement
Selection predicates are "sprinkled"—assigned as early as possible to minimize intermediate result sizes:
- If references just : annotate leaf node .
- Otherwise, place at the join DAG node whose scope covers all referenced tables, lowest in the DAG.
The cost model is then adjusted:
- For selection at with selectivity :
- Update
- Update all upstream costs to reflect the reduced output size.
This refinement can change the optimal plan, as introducing a highly selective predicate early can change cardinality and join order preferences [0202035].
5. Applications and Experimental Results
Join DAGs provide a compact representation for searching optimal join orders and efficiently supporting multiple-query optimization, including selection and materialized view selection in data warehouse environments. Empirical analysis on TPC-D/H query sets demonstrates that join DAG-based optimization outperforms traditional AND/OR DAG techniques in both preprocessing cost and plan quality [0202035].
The approach directly benefits systems requiring:
- Exhaustive plan enumeration without redundant computation,
- Adaptive reuse in evolving query workloads,
- Optimized plan selection in the presence of complex selection and join conditions.
6. Worked Example and Practical Workflow
Consider a schema with joins and :
- DAG construction identifies nodes for every nonempty subset (e.g., ), with plan edges reflecting joinability.
- Cost calculation selects, for , the lower-cost of possible join sequences (e.g., vs. ), given cardinalities and selectivities.
- Incremental update supports dynamic addition of join predicates (e.g., introducing ), by reusing existing nodes and only building new joins involving novel relations.
- Selection refinement: Pushing to , recalculates affected costs and may alter the optimal join order.
These steps collectively enable implementers to instantiate a complete join optimization pipeline leveraging the join DAG paradigm, as illustrated in [0202035].
Sponsored by Paperpile, the PDF & BibTeX manager trusted by top AI labs.
Get 30 days free