Join DAG Construction for Query Optimization

Updated 18 November 2025

Join DAG construction is a method that builds a directed acyclic graph representing all intermediate join results for systematic query optimization.
It employs both full enumeration and incremental updating techniques to efficiently manage complex multi-join queries.
Cost models based on cardinality and selectivity are integral, enabling early selection placement that refines join plans and minimizes intermediate result sizes.

A join Directed Acyclic Graph (join DAG) is a foundational data structure in the representation and optimization of relational database queries involving join operations. Each node in a join DAG corresponds to an intermediate join result over a subset of base relations, and directed edges represent individual join operations. The construction, maintenance, and utilization of join DAGs directly determine the efficiency and scalability of query optimization, particularly when searching for minimal-cost execution strategies in complex queries with multiple joins and selection predicates. Join DAGs support both pre-computed and incremental formation, enabling rapid generation and refinement of optimal join plans as query workloads evolve or as selections are "sprinkled" throughout the query plan [0202035].

1. Formal Definitions and Core Structures

Given a relational schema $R = \{R_1, R_2, \dots, R_n\}$ , the join graph $G = (V, E)$ is an undirected graph with $V$ as the relations and $E = \{(R_i, R_j, \theta_{ij})\}$ where join predicates $\theta_{ij}$ exist. The join DAG expands this, with directed nodes representing all intermediate joins built over subsets $S(u) \subseteq R$ .

Leaf nodes: Singleton sets $\{R_i\}$ corresponding to scans of base relations.
Internal nodes: Each with a scope $S(u) = X \cup Y$ where $X \cap Y = \emptyset$ and some join predicate connects $X$ and $Y$ .
Root node: Scope is $R$ , representing the complete query result.

For every subset $S \subseteq R$ , at most one node exists for each unique join result, maximizing subplan sharing and reducing combinatorial blowup typical of naive join enumeration [0202035].

2. Join DAG Construction Algorithms

2.1 Pre-computation (Full Enumeration)

The standard approach constructs the join DAG bottom-up, iterating over all subsets of $R$ in increasing cardinality. For each subset $S$ , all partitions into disjoint $X, Y$ are examined, and nodes plus edges are added if a join edge exists between the two subsets. Afterward, a dynamic programming pass annotates each node with its minimal construction cost:

function BUILD_JOIN_DAG(V, E):
  nodeMap ← new hash-map<bitmask, Node>()
  for each i in [1..n]:
    mask ← 1 << (i-1)
    nodeMap[mask] ← Node(scope= {R_i}, cost=0)
  for size from 2 to n:
    for each bitmask S of size 'size' over n bits:
      for each nonempty proper submask X of S:
        Y = S \ X
        if EXISTS_JOIN_EDGE(X, Y, E):
          left  ← nodeMap[X]
          right ← nodeMap[Y]
          newNodeMask ← S
          if nodeMap does not contain newNodeMask:
            nodeMap[newNodeMask] ← Node(scope=relations(S))
          u ← nodeMap[newNodeMask]
          u.addChild(left)
          u.addChild(right)
  return nodeMap[(1<<n) - 1]

The resulting join DAG enables exponential sharing: all distinct orderings of joins among the same relations reuse the same intermediate nodes [0202035].

2.2 Incremental (History) Join DAG Formation

For dynamic, workload-driven optimization, the history join DAG supports incremental updates. When a new query with $R_\text{new} \subseteq R_\text{hist}$ arrives, only the minimal necessary subgraph is constructed, leveraging existing nodes where possible. The incremental algorithm scans for new subsets missing from the current DAG, creates nodes for those, and recomputes costs in affected subgraphs:

function INCREMENTAL_UPDATE(D_hist, R_new, E_new):
  maskNew ← bitmask of R_new
  existingMasks ← set of masks in D_hist
  neededMasks ← all nonempty submasks of maskNew
  for each mask S in neededMasks in increasing popcount:
    if S ∉ existingMasks:
      for each nonempty proper submask X of S:
        Y = S \ X
        if (X ∈ existingMasks or X newly added) and
           (Y ∈ existingMasks or Y newly added) and
           EXISTS_JOIN_EDGE(X, Y, E_hist ∪ E_new):
             create node for S if not exists
             link children X, Y
             existingMasks ∪= {S}
             break
  for each newly created node u in topological order:
    compute cost(u) as before
  return updated D_hist

This supports efficient adaptation to schema evolution or repeated/related queries [0202035].

3. Cost Models and Complexity

Each join DAG node is annotated with cost estimates (often cardinality-based) and outputs, propagated bottom-up:

For a join producing node $u$ $u$ from $v, w$ $v, w$ :
- $|u| = |v| \times |w| \times \text{sel}(v, w)$
- $\text{cost}(u) = \text{cost}(v) + \text{cost}(w) + \alpha \cdot |v| \cdot |w| \cdot \text{sel}(v,w)$ ,
- where $\alpha$ absorbs per-tuple resource costs.

Complexity is dictated by the number of subsets (nodes) and possible partitions per subset:

Nodes: $2^n$
Edges: up to $O(3^n)$
Total time: $O(3^n)$ due to enumerating all binary partitions for all subsets.

Incremental construction and caching can amortize this cost in query workloads with recurring access patterns or modest query width [0202035].

Selection predicates are "sprinkled"—assigned as early as possible to minimize intermediate result sizes:

If $\sigma$ references just $R_i$ : annotate leaf node $\{R_i\}$ .
Otherwise, place $\sigma$ at the join DAG node whose scope covers all referenced tables, lowest in the DAG.

The cost model is then adjusted:

For selection at $u$ $u$ with selectivity $\text{sel}(\sigma)$ $sel (σ)$ :
- Update $|u| \leftarrow |u| \times \text{sel}(\sigma)$
- Update all upstream costs to reflect the reduced output size.

This refinement can change the optimal plan, as introducing a highly selective predicate early can change cardinality and join order preferences [0202035].

5. Applications and Experimental Results

Join DAGs provide a compact representation for searching optimal join orders and efficiently supporting multiple-query optimization, including selection and materialized view selection in data warehouse environments. Empirical analysis on TPC-D/H query sets demonstrates that join DAG-based optimization outperforms traditional AND/OR DAG techniques in both preprocessing cost and plan quality [0202035].

The approach directly benefits systems requiring:

Exhaustive plan enumeration without redundant computation,
Adaptive reuse in evolving query workloads,
Optimized plan selection in the presence of complex selection and join conditions.

6. Worked Example and Practical Workflow

Consider a schema $R = \{A, B, C\}$ with joins $A.x = B.x$ and $B.y = C.y$ :

DAG construction identifies nodes for every nonempty subset (e.g., $\{A\}, \{B\}, \{C\}, \{A,B\}, \{B,C\}, \{A,B,C\}$ ), with plan edges reflecting joinability.
Cost calculation selects, for $\{A,B,C\}$ , the lower-cost of possible join sequences (e.g., $(a \bowtie (bc))$ vs. $(ab) \bowtie c$ ), given cardinalities and selectivities.
Incremental update supports dynamic addition of join predicates (e.g., introducing $D$ ), by reusing existing nodes and only building new joins involving novel relations.
Selection refinement: Pushing $\sigma: B.y > 100$ to $\{B\}$ , recalculates affected costs and may alter the optimal join order.

These steps collectively enable implementers to instantiate a complete join optimization pipeline leveraging the join DAG paradigm, as illustrated in [0202035].

PDF Markdown Chat (Pro)

Follow Topic

Get notified by email when new papers are published related to Join DAG Construction.

Join DAG Construction for Query Optimization

1. Formal Definitions and Core Structures

2. Join DAG Construction Algorithms

2.1 Pre-computation (Full Enumeration)

2.2 Incremental (History) Join DAG Formation

3. Cost Models and Complexity

4. Selection Placement and Plan Refinement

5. Applications and Experimental Results

6. Worked Example and Practical Workflow

Follow Topic

Continue Learning

Join DAG Construction for Query Optimization

1. Formal Definitions and Core Structures

2. Join DAG Construction Algorithms

2.1 Pre-computation (Full Enumeration)

2.2 Incremental (History) Join DAG Formation

3. Cost Models and Complexity

4. Selection Placement and Plan Refinement

5. Applications and Experimental Results

6. Worked Example and Practical Workflow

Follow Topic

Continue Learning

Related Topics