Papers
Topics
Authors
Recent
Search
2000 character limit reached

CoLSE: Copula-based Learned Selectivity Estimator

Updated 21 December 2025
  • The paper introduces CoLSE, a hybrid model that integrates copula-based joint CDF estimation with a neural network for robust cardinality estimation.
  • It applies D-vine copula decomposition with PCHIP splines to achieve sub-millisecond inference and near–state-of-the-art plan-match accuracy.
  • Experimental results show CoLSE’s efficiency and compact design (<3 MB), making it viable for modern query optimizer pipelines in production environments.

CoLSE (Copula-based Learned Selectivity Estimator) is a hybrid learned model for single-table cardinality estimation (CE) that directly models the joint cumulative distribution function (CDF) of queried attributes via copula theory, and augments this with a compact neural network for error compensation. The method is designed to provide accurate, efficient, and memory-light selectivity estimation for conjunctive selection queries in relational databases. CoLSE achieves near–state-of-the-art accuracy with sub-millisecond inference, rapid training, and a model size under 3 MB, addressing practical requirements of modern query optimization pipelines (Rathuwadu et al., 14 Dec 2025).

1. The Cardinality Estimation Problem and Motivations

Cardinality estimation is the task of predicting the result size ∣q(T)∣|q(T)| of a query qq against a base relation TT; selectivity is defined as sel(q)=∣q(T)∣/∣T∣\mathrm{sel}(q) = |q(T)| / |T|. Accurate CE is foundational for cost-based query optimizers, where cascading estimation errors can cause poor physical plan selection and drastically increased runtimes. State-of-the-art CE approaches generally fall into two categories: query-driven methods, which directly regress from query features to cardinality, and data-driven methods, which learn the data distribution to estimate P(predicates)\mathbb{P}(\text{predicates}). For real-world adoption, a CE method must efficiently trade off inference speed, training time, accuracy, and memory.

CoLSE was developed to overcome the inability of existing approaches to achieve an optimal balance among these metrics, particularly in scenarios requiring repeated sub-millisecond invocations and compact (few-megabyte) models suitable for deployment within primary DBMS executables (Rathuwadu et al., 14 Dec 2025).

2. Copula-based Joint CDF Estimation

CoLSE exploits the copula-theoretic decomposition to efficiently model the joint CDF of selection predicates, bypassing high-dimensional density estimation and the need for sampling. The marginal CDF for each attribute AkA_k is numerically binned (typically B≈5,000B \approx 5,000 buckets) and fitted using monotonic PCHIP splines: Fk(x)=P(Ak≤x)F_k(x) = \mathbb{P}(A_k \leq x). By Sklar’s theorem, the joint CDF over nn attributes is written as

P(A1≤x1,...,An≤xn)=C(F1(x1),...,Fn(xn)),\mathbb{P}(A_1 \leq x_1, ..., A_n \leq x_n) = C(F_1(x_1), ..., F_n(x_n)),

where qq0 is a copula encapsulating dependencies among marginals. CoLSE specifically uses a D-vine sequence of bivariate (pair-copula) functions with conditioning, reducing the parameter space to qq1 Gumbel copula fit parameters (each parameter qq2 estimated using Kendall’s qq3).

The range selectivity for qq4 is computed via inclusion–exclusion over copula increments; for higher qq5, recursive corner-based inclusion–exclusion and conditional copula estimation is employed. This design allows:

  • Deterministic, schema-order-based inference;
  • Closed-form CDF increment computation, avoiding Monte Carlo;
  • Efficient O(qq6) parameter usage per query (Rathuwadu et al., 14 Dec 2025).

3. Error Compensation Neural Network

While copula-based CDF estimation (JPE) captures much of the attribute dependency, systematic residual errors persist. CoLSE introduces an Error Compensation Network (ECN), a lightweight feedforward neural network (4 layers: 256qq7256qq8128qq964 units, ReLU activations, 3 linear heads) that predicts (1) the log-magnitude of the residual TT0, and (2) the probabilities that the copula estimate TT1 under- or over-estimates the ground truth.

The ECN operates on features: normalized marginal CDFs at predicate bounds, the copula selectivity estimate, and the classic independence-attributes AVI heuristic. During inference, if the model is confident TT2, the compensated selectivity is evaluated as TT3; otherwise, TT4. Training minimizes an MSE+BCE loss over these outputs (Rathuwadu et al., 14 Dec 2025).

4. Training and Inference Workflow

CoLSE’s training pipeline is as follows:

  1. Marginal estimation: Read TT5 and fit spline marginals TT6 for each attribute.
  2. Copula fitting: Estimate all pairwise Gumbel copula parameters using Kendall’s TT7.
  3. Query workload: Generate or accept a set TT8 of training queries with exact selectivities.
  4. JPE calculation: Compute copula CDF-based selectivities TT9.
  5. ECN training: Train with Adam (sel(q)=∣q(T)∣/∣T∣\mathrm{sel}(q) = |q(T)| / |T|0 learning rate, 25 epochs), using residuals as targets.

Inference for a new query is O(sel(q)=∣q(T)∣/∣T∣\mathrm{sel}(q) = |q(T)| / |T|1) in the number of predicates, but as sel(q)=∣q(T)∣/∣T∣\mathrm{sel}(q) = |q(T)| / |T|2 in practice, wall times are consistently sub-1.5 ms over public datasets. The ECN’s inference time and memory overhead are negligible compared to DNN-based regressors (Rathuwadu et al., 14 Dec 2025).

<table> <thead> <tr><th>Workflow Step</th><th>Details</th><th>Resource impact</th></tr> </thead> <tbody> <tr><td>Marginal CDF fitting</td><td>PCHIP spline, ~2 MB</td><td>Memory, speed</td></tr> <tr><td>Copula fitting</td><td>O(sel(q)=∣q(T)∣/∣T∣\mathrm{sel}(q) = |q(T)| / |T|3) Gumbel sel(q)=∣q(T)∣/∣T∣\mathrm{sel}(q) = |q(T)| / |T|4, <0.1 MB</td><td>Accuracy, memory</td></tr> <tr><td>ECN training</td><td>Parameters ≈0.8 MB</td><td>Accuracy, negligible latency</td></tr> </tbody> </table>

5. Experimental Evaluation and Benchmarks

Extensive evaluation includes four real-world datasets (Census, Forest, Power, DMV) and synthetic (correlated, skewed TPC-H lineitem). The main metrics are plan-matching accuracy (fraction of test queries where PostgreSQL chooses the same plan as under ground-truth cardinalities) and Q-error percentiles for join extension. CoLSE achieves:

  • Plan-match rates of 95–96% (vs. DeepDB/Naru: 93–94%, MSC: 94–95%);
  • 0.8–1.5 ms inference latency (vs. 2–10 ms for deep autoregressive, 0.2–0.5 ms for query-driven);
  • Training time under 5 min for sel(q)=∣q(T)∣/∣T∣\mathrm{sel}(q) = |q(T)| / |T|5 rows (vs. 20–70 min for deep/data-driven CE);
  • Model size <3 MB (substantially smaller than Naru/DeepDB/MSCN/LW-XGB).

CoLSE shows robustness under attribute correlation, heavy workload drift, and data updates, with plan-match degradation always sel(q)=∣q(T)∣/∣T∣\mathrm{sel}(q) = |q(T)| / |T|6 without retraining (and sel(q)=∣q(T)∣/∣T∣\mathrm{sel}(q) = |q(T)| / |T|7 after marginal-only update) (Rathuwadu et al., 14 Dec 2025).

6. Complexity, Model Size, and Operational Footprint

For a query with sel(q)=∣q(T)∣/∣T∣\mathrm{sel}(q) = |q(T)| / |T|8 predicates, copula-based selectivity computation entails sel(q)=∣q(T)∣/∣T∣\mathrm{sel}(q) = |q(T)| / |T|9 pair-copula evaluations, but negligible as P(predicates)\mathbb{P}(\text{predicates})0 in practical single-table scenarios. Memory breakdown: 2 MB for marginals, <0.1 MB for copula parameters, 0.8 MB for ECN weights; total under 3 MB. In comparison, leading query-driven and deep models consume between 10 MB and 67 MB.

Training consists of marginal fitting (1–2 min), copula parameter estimation, and fast ECN training (P(predicates)\mathbb{P}(\text{predicates})13 min for 80K queries). The full DMV dataset (11.6M rows) is handled in under 5 min (Rathuwadu et al., 14 Dec 2025).

7. Trade-offs, Limitations, and Future Directions

CoLSE’s primary limitation is quadratic scaling in the number of predicates due to the O(P(predicates)\mathbb{P}(\text{predicates})2) pair-copula recursion, making it less ideal for uncommonly high-dimensional CE queries. It currently supports only conjunctions of range/equality predicates. Extensions under exploration include:

  • Generalization to arbitrary Boolean predicates;
  • Non-Archimedean or neural copula layers for enhanced dependency modeling;
  • Efficient vine recursion for large P(predicates)\mathbb{P}(\text{predicates})3 via low-rank or sparse structures.

The design demonstrates that joint CDF modeling via vine copulas, augmented by a neural residual, can deliver robust accuracy, low latency, and small model size suitable for production-grade cardinality estimation (Rathuwadu et al., 14 Dec 2025).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to CoLSE.