Papers
Topics
Authors
Recent
Search
2000 character limit reached

Multi-Set Convolutional Network (MSCN)

Updated 12 May 2026
  • MSCN is a deep learning model for cardinality estimation that represents queries as unordered sets of tables, joins, and predicates.
  • It employs permutation-invariant deep set operations with average pooling and MLPs to efficiently capture correlations and handle sparse, zero-tuple scenarios.
  • Empirical results show MSCN outperforms traditional sampling and histogram methods, offering a compact, scalable solution for complex relational queries.

The Multi-Set Convolutional Network (MSCN) is a deep learning architecture designed for cardinality estimation in relational databases, as introduced in "Learned Cardinalities: Estimating Correlated Joins with Deep Learning" (Kipf et al., 2018). MSCN represents relational queries as sets of tables, join predicates, and base-table predicates, capturing set semantics via neural permutation invariance. It improves on traditional sampling and histogram-based estimators by addressing challenges such as sparse selectivity (zero-tuple situations), capturing join-crossing correlations, and scalability in footprint and computation.

1. Query Representation as Unordered Sets

MSCN encodes each conjunctive select-join query q∈Qq \in Q as a triple of sets (Tq,Jq,Pq)(T_q, J_q, P_q):

  • Tq⊆TT_q \subseteq T: the set of base tables involved,
  • Jq⊆JJ_q \subseteq J: the set of join predicates (primarily foreign-key = primary-key),
  • Pq⊆PP_q \subseteq P: the set of selection predicates, formatted as (column, operator, value).

Each element is featurized as follows:

  • Table t∈Tt \in T: one-hot vector vt∈{0,1}∣T∣v_t \in \{0,1\}^{|T|}, optionally concatenated with either a scalar sts_t (qualifying sample count) or a bitmap bt∈{0,1}Sb_t \in \{0,1\}^S (bitmask for SS materialized samples).
  • Join (Tq,Jq,Pq)(T_q, J_q, P_q)0: one-hot (Tq,Jq,Pq)(T_q, J_q, P_q)1.
  • Predicate (Tq,Jq,Pq)(T_q, J_q, P_q)2: one-hot vectors for (Tq,Jq,Pq)(T_q, J_q, P_q)3 and (Tq,Jq,Pq)(T_q, J_q, P_q)4 ((Tq,Jq,Pq)(T_q, J_q, P_q)5 for (Tq,Jq,Pq)(T_q, J_q, P_q)6, (Tq,Jq,Pq)(T_q, J_q, P_q)7, (Tq,Jq,Pq)(T_q, J_q, P_q)8), plus a normalized (Tq,Jq,Pq)(T_q, J_q, P_q)9, derived by linear scaling within column min-max.

This representation supports query featurizations such as:

  • Tables: Tq⊆TT_q \subseteq T0 each as one-hot.
  • Joins: Tq⊆TT_q \subseteq T1 as one-hot.
  • Predicates: Tq⊆TT_q \subseteq T2 as one-hot column, one-hot operator Tq⊆TT_q \subseteq T3, Tq⊆TT_q \subseteq T4.

2. Multi-Set Convolutional Operator

MSCN employs the Deep Sets theorem, representing a permutation-invariant function Tq⊆TT_q \subseteq T5 on set Tq⊆TT_q \subseteq T6 as Tq⊆TT_q \subseteq T7. For each query:

  • Tq⊆TT_q \subseteq T8
  • Tq⊆TT_q \subseteq T9
  • Jq⊆JJ_q \subseteq J0

Here, Jq⊆JJ_q \subseteq J1 denotes a two-layer fully connected network with ReLU activation. Average pooling ensures stable embedding magnitude regardless of input cardinality. When using sampling bitmaps, Jq⊆JJ_q \subseteq J2 includes the one-hot table ID concatenated with Jq⊆JJ_q \subseteq J3 or Jq⊆JJ_q \subseteq J4.

3. Architecture and Cardinality Regression

The pooled embeddings Jq⊆JJ_q \subseteq J5, Jq⊆JJ_q \subseteq J6, and Jq⊆JJ_q \subseteq J7 (each Jq⊆JJ_q \subseteq J8 with Jq⊆JJ_q \subseteq J9) are concatenated:

Pq⊆PP_q \subseteq P0

This vector Pq⊆PP_q \subseteq P1 is processed by a final output MLP (Pq⊆PP_q \subseteq P2) with a sigmoid activation in the last layer. The output Pq⊆PP_q \subseteq P3 is interpreted as a normalized log-cardinality estimate. Normalization is performed by scaling the ground-truth log-cardinality into Pq⊆PP_q \subseteq P4 range over the training dataset:

Pq⊆PP_q \subseteq P5

At inference, normalization is inverted to recover the actual predicted cardinality.

4. Training Objective and Optimization

The network is trained to minimize the mean Pq⊆PP_q \subseteq P6-error over a query set Pq⊆PP_q \subseteq P7:

Pq⊆PP_q \subseteq P8

Pq⊆PP_q \subseteq P9

where t∈Tt \in T0 is the de-normalized estimate and t∈Tt \in T1 is the true cardinality. The mean t∈Tt \in T2-error directly reflects the multiplicative error metric relevant for optimizers. Comparative experiments with MSE and geometric mean t∈Tt \in T3-error showed the mean t∈Tt \in T4-error delivered superior empirical results. Training uses the Adam optimizer (learning rate 0.001).

Key hyperparameters:

  • Hidden dimension t∈Tt \in T5,
  • Each MLP uses two layers,
  • Batch size 1024, trained for 100 epochs,
  • Materialized sample size t∈Tt \in T6 tuples per table,
  • Training/validation split of 90,000 / 10,000 queries.

5. Integration of Sampling-Based Information

To leverage sampling strengths while mitigating zero-tuple failure modes, MSCN augments the table feature vector t∈Tt \in T7 with either:

  • The scalar t∈Tt \in T8 (number of qualifying samples among t∈Tt \in T9 pre-materialized tuples),
  • The full vt∈{0,1}∣T∣v_t \in \{0,1\}^{|T|}0-bit bitmap vt∈{0,1}∣T∣v_t \in \{0,1\}^{|T|}1 (records which tuples satisfy predicates).

When vt∈{0,1}∣T∣v_t \in \{0,1\}^{|T|}2 ("0-tuple"), conventional sampling must revert to independence assumptions or uniformity heuristics, often resulting in large errors. MSCN's architecture, retaining set structure and historical exposure to zero-sample patterns, allows it to learn correlations (e.g., high selectivity conjunctions) unavailable to classical sampling alone.

6. Capturing Join-Crossing Correlations

Traditional histograms and sampling-based approaches generally assume independence across joins or have limited ability to handle absence of join-qualifying samples. MSCN's parallel aggregation of tables, joins, and predicates allows the final output MLP to integrate cross-signals and detect join-crossing correlations. For example, the model can associate patterns such as "French actors (Person.nationality=FR) are disproportionately in romantic movies," a dependency structure that is inaccessible to attribute-wise statistics and conventional index probing. This architecture enables the network to address limitations that "cripple all traditional estimators," particularly in multi-table, correlated settings.

7. Empirical Performance, Limitations, and Future Directions

Empirical benchmarks on the IMDb dataset (2.5M movies, 4M actors) included synthetic (5,000 queries, 0–2 joins), scale (500 queries, generalization to 3–4 joins), and JOB-light (70 real-world queries). On the synthetic workload using bitmaps:

  • Median vt∈{0,1}∣T∣v_t \in \{0,1\}^{|T|}3-error: 1.18 (compared to 1.69 for PostgreSQL, 1.89 for RS, 1.09 for IBJS),
  • vt∈{0,1}∣T∣v_t \in \{0,1\}^{|T|}4 percentile vt∈{0,1}∣T∣v_t \in \{0,1\}^{|T|}5-error: 6.84 (vs. 23.9 / 53.4 / 33.2),
  • Mean vt∈{0,1}∣T∣v_t \in \{0,1\}^{|T|}6-error: 2.89 (vs. 154 / 125 / 118).

On highly selective "0-tuple" queries (376 queries), MSCN median vt∈{0,1}∣T∣v_t \in \{0,1\}^{|T|}7-error: 2.94, outperforming RS (9.13) and PostgreSQL (4.78). For scalability, MSCN's vt∈{0,1}∣T∣v_t \in \{0,1\}^{|T|}8 percentile vt∈{0,1}∣T∣v_t \in \{0,1\}^{|T|}9-error scaled from 38.6 (3 joins) to 2397 (4 joins), still outperforming PostgreSQL. For JOB-light queries, median sts_t0-error was 3.82 (Postgres: 7.93, RS: 11.5, IBJS: 1.59), sts_t1 percentile 362 (vs. Postgres: 1104, RS: 4073).

Model footprint remains under 3 MiB, a marked efficiency compared to full indexes as used by IBJS.

Limitations include:

  • Diminished generalization for queries diverging from the training join and predicate distributions,
  • Lack of support for complex predicates (e.g., LIKE, disjunctions), which require fallback to bitmap or histogram features,
  • Point-estimate predictions only (i.e., absence of uncertainty quantification),
  • Static snapshot assumption, so adapting to schema/data changes requires retraining or risk-prone online updates,
  • Restriction to numeric/string attribute encoding via hashing, needing enriched supervision.

Potential directions cited include uncertainty estimation via deep ensembles or dropout, adaptive query generation targeting high-error schema regions, and extending predicate-level bitmap support.

MSCN by design leverages set invariance and sampling signals in a compact neural model. This permits reductions in both central and tail-tier cardinality estimation errors, robustness to zero-sample regimes, and recognition of complex data correlations, positioning MSCN as a first successful step towards learned, memory-efficient cardinality estimation in relational database management systems (Kipf et al., 2018).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Multi-Set Convolutional Network (MSCN).