Papers
Topics
Authors
Recent
2000 character limit reached

CoLSE: Copula-based Learned Selectivity Estimator

Updated 21 December 2025
  • The paper introduces CoLSE, a hybrid model that integrates copula-based joint CDF estimation with a neural network for robust cardinality estimation.
  • It applies D-vine copula decomposition with PCHIP splines to achieve sub-millisecond inference and near–state-of-the-art plan-match accuracy.
  • Experimental results show CoLSE’s efficiency and compact design (<3 MB), making it viable for modern query optimizer pipelines in production environments.

CoLSE (Copula-based Learned Selectivity Estimator) is a hybrid learned model for single-table cardinality estimation (CE) that directly models the joint cumulative distribution function (CDF) of queried attributes via copula theory, and augments this with a compact neural network for error compensation. The method is designed to provide accurate, efficient, and memory-light selectivity estimation for conjunctive selection queries in relational databases. CoLSE achieves near–state-of-the-art accuracy with sub-millisecond inference, rapid training, and a model size under 3 MB, addressing practical requirements of modern query optimization pipelines (Rathuwadu et al., 14 Dec 2025).

1. The Cardinality Estimation Problem and Motivations

Cardinality estimation is the task of predicting the result size q(T)|q(T)| of a query qq against a base relation TT; selectivity is defined as sel(q)=q(T)/T\mathrm{sel}(q) = |q(T)| / |T|. Accurate CE is foundational for cost-based query optimizers, where cascading estimation errors can cause poor physical plan selection and drastically increased runtimes. State-of-the-art CE approaches generally fall into two categories: query-driven methods, which directly regress from query features to cardinality, and data-driven methods, which learn the data distribution to estimate P(predicates)\mathbb{P}(\text{predicates}). For real-world adoption, a CE method must efficiently trade off inference speed, training time, accuracy, and memory.

CoLSE was developed to overcome the inability of existing approaches to achieve an optimal balance among these metrics, particularly in scenarios requiring repeated sub-millisecond invocations and compact (few-megabyte) models suitable for deployment within primary DBMS executables (Rathuwadu et al., 14 Dec 2025).

2. Copula-based Joint CDF Estimation

CoLSE exploits the copula-theoretic decomposition to efficiently model the joint CDF of selection predicates, bypassing high-dimensional density estimation and the need for sampling. The marginal CDF for each attribute AkA_k is numerically binned (typically B5,000B \approx 5,000 buckets) and fitted using monotonic PCHIP splines: Fk(x)=P(Akx)F_k(x) = \mathbb{P}(A_k \leq x). By Sklar’s theorem, the joint CDF over nn attributes is written as

P(A1x1,...,Anxn)=C(F1(x1),...,Fn(xn)),\mathbb{P}(A_1 \leq x_1, ..., A_n \leq x_n) = C(F_1(x_1), ..., F_n(x_n)),

where CC is a copula encapsulating dependencies among marginals. CoLSE specifically uses a D-vine sequence of bivariate (pair-copula) functions with conditioning, reducing the parameter space to O(n2)O(n^2) Gumbel copula fit parameters (each parameter θ\theta estimated using Kendall’s τ\tau).

The range selectivity for n=2n=2 is computed via inclusion–exclusion over copula increments; for higher nn, recursive corner-based inclusion–exclusion and conditional copula estimation is employed. This design allows:

  • Deterministic, schema-order-based inference;
  • Closed-form CDF increment computation, avoiding Monte Carlo;
  • Efficient O(nn) parameter usage per query (Rathuwadu et al., 14 Dec 2025).

3. Error Compensation Neural Network

While copula-based CDF estimation (JPE) captures much of the attribute dependency, systematic residual errors persist. CoLSE introduces an Error Compensation Network (ECN), a lightweight feedforward neural network (4 layers: 256\rightarrow256\rightarrow128\rightarrow64 units, ReLU activations, 3 linear heads) that predicts (1) the log-magnitude of the residual (logseltrues0)(\log|\text{sel}_\text{true}-s_0|), and (2) the probabilities that the copula estimate s0s_0 under- or over-estimates the ground truth.

The ECN operates on features: normalized marginal CDFs at predicate bounds, the copula selectivity estimate, and the classic independence-attributes AVI heuristic. During inference, if the model is confident (max(P+,P)>0.5)(\max(P^+,P^-)>0.5), the compensated selectivity is evaluated as s=s0+sign(P+P)exp(r^)s = s_0 + \mathrm{sign}(P^+ - P^-)\exp(\hat{r}); otherwise, s=s0s = s_0. Training minimizes an MSE+BCE loss over these outputs (Rathuwadu et al., 14 Dec 2025).

4. Training and Inference Workflow

CoLSE’s training pipeline is as follows:

  1. Marginal estimation: Read TT and fit spline marginals FkF_k for each attribute.
  2. Copula fitting: Estimate all pairwise Gumbel copula parameters using Kendall’s τ\tau.
  3. Query workload: Generate or accept a set QQ of training queries with exact selectivities.
  4. JPE calculation: Compute copula CDF-based selectivities s0(q)s_0(q).
  5. ECN training: Train with Adam (10310^{-3} learning rate, 25 epochs), using residuals as targets.

Inference for a new query is O(n2n^2) in the number of predicates, but as n10n\ll10 in practice, wall times are consistently sub-1.5 ms over public datasets. The ECN’s inference time and memory overhead are negligible compared to DNN-based regressors (Rathuwadu et al., 14 Dec 2025).

<table> <thead> <tr><th>Workflow Step</th><th>Details</th><th>Resource impact</th></tr> </thead> <tbody> <tr><td>Marginal CDF fitting</td><td>PCHIP spline, ~2 MB</td><td>Memory, speed</td></tr> <tr><td>Copula fitting</td><td>O(n2n^2) Gumbel θ\theta, <0.1 MB</td><td>Accuracy, memory</td></tr> <tr><td>ECN training</td><td>Parameters ≈0.8 MB</td><td>Accuracy, negligible latency</td></tr> </tbody> </table>

5. Experimental Evaluation and Benchmarks

Extensive evaluation includes four real-world datasets (Census, Forest, Power, DMV) and synthetic (correlated, skewed TPC-H lineitem). The main metrics are plan-matching accuracy (fraction of test queries where PostgreSQL chooses the same plan as under ground-truth cardinalities) and Q-error percentiles for join extension. CoLSE achieves:

  • Plan-match rates of 95–96% (vs. DeepDB/Naru: 93–94%, MSC: 94–95%);
  • 0.8–1.5 ms inference latency (vs. 2–10 ms for deep autoregressive, 0.2–0.5 ms for query-driven);
  • Training time under 5 min for >107>10^7 rows (vs. 20–70 min for deep/data-driven CE);
  • Model size <3 MB (substantially smaller than Naru/DeepDB/MSCN/LW-XGB).

CoLSE shows robustness under attribute correlation, heavy workload drift, and data updates, with plan-match degradation always <2%<2\% without retraining (and <0.5%<0.5\% after marginal-only update) (Rathuwadu et al., 14 Dec 2025).

6. Complexity, Model Size, and Operational Footprint

For a query with nn predicates, copula-based selectivity computation entails O(n2)O(n^2) pair-copula evaluations, but negligible as n10n\ll10 in practical single-table scenarios. Memory breakdown: 2 MB for marginals, <0.1 MB for copula parameters, 0.8 MB for ECN weights; total under 3 MB. In comparison, leading query-driven and deep models consume between 10 MB and 67 MB.

Training consists of marginal fitting (1–2 min), copula parameter estimation, and fast ECN training (\leq3 min for 80K queries). The full DMV dataset (11.6M rows) is handled in under 5 min (Rathuwadu et al., 14 Dec 2025).

7. Trade-offs, Limitations, and Future Directions

CoLSE’s primary limitation is quadratic scaling in the number of predicates due to the O(n2n^2) pair-copula recursion, making it less ideal for uncommonly high-dimensional CE queries. It currently supports only conjunctions of range/equality predicates. Extensions under exploration include:

  • Generalization to arbitrary Boolean predicates;
  • Non-Archimedean or neural copula layers for enhanced dependency modeling;
  • Efficient vine recursion for large nn via low-rank or sparse structures.

The design demonstrates that joint CDF modeling via vine copulas, augmented by a neural residual, can deliver robust accuracy, low latency, and small model size suitable for production-grade cardinality estimation (Rathuwadu et al., 14 Dec 2025).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Whiteboard

Topic to Video (Beta)

Follow Topic

Get notified by email when new papers are published related to CoLSE.