Pessimistic Cardinality Estimation
- Pessimistic cardinality estimation is a framework that computes guaranteed upper bounds on multi-way join outputs using LP optimization and entropy constraints.
- It leverages ℓp-norm degree statistics to derive tight, one-sided estimates that enhance resource provisioning and robust query plan selection.
- Practical implementations like SafeBound and LpBound demonstrate sub-millisecond inference and improved accuracy in avoiding underestimated query results.
Pessimistic cardinality estimation is a formal framework for computing guaranteed upper bounds on the output size of database queries, especially multi-way joins, using precomputed statistics extracted from input relations. Contrary to traditional (optimistic or unbiased) estimators—which frequently rely on heuristic assumptions of attribute independence or uniformity and may significantly under- or over-estimate the actual cardinality—pessimistic approaches guarantee that, for any database instance consistent with the maintained statistics, the estimate never underestimates the true result size. This one-sided safety property makes pessimistic methods particularly suitable for resource provisioning, query plan selection, and robust avoidance of out-of-memory or high-cost execution scenarios in relational database systems.
1. Theoretical Foundation: Entropy and Information-Theoretic Bounds
Pessimistic cardinality estimation formulates the join size estimation task as a constrained optimization problem on the space of random variables induced by join-output tuples. Let denote a full conjunctive query. Viewing the answer set as a uniform distribution over its tuples, the Shannon entropy equals the logarithm of : Upper bounds on thus emerge as upper bounds on joint entropy, subject to both classical Shannon inequalities (monotonicity and submodularity, i.e., polymatroid constraints) and empirical data statistics distilled into information inequalities.
The central insight, formalized by Abo Khamis, Nakos, Olteanu, and Suciu, is that for each available -norm statistic on degree sequences of join attributes in the input relations, one derives a constraint of the form
where is the entropy surrogate variable, indexes the conditioning attributes, indexes the joined attributes, and is the -norm of the degree sequence of grouping by and counting -extensions (Khamis et al., 5 Mar 2025, Zhang et al., 9 Feb 2025). The tightest possible upper bound on consistent with the statistics is then
where is the optimal value of a linear program maximizing under the above entropy and statistical constraints.
2. Degree Sequences and Norm-Based Constraints
Input statistics are summarized as degree sequences. For each relation and disjoint attribute sets , the degree sequence gives the number of -tuples associated with each -value in . The family of derived -norms,
interpolates between total table size (), maximum degree (), and frequency moment bounds for (Zhang et al., 9 Feb 2025, Khamis et al., 2024). Using higher-order moments enables finer-grained quantification of data skew and yields strictly tighter bounds than only using cardinalities or max-degrees.
Each degree statistic gives rise to an entropy constraint and, by LP duality, an explicit "q-inequality" upper bound on as a product of norm values raised to combinatorially computable weights.
3. Linear Programming Frameworks and Combinatorial Algorithms
Pessimistic cardinality bounds arise as the optima of linear programs involving entropy variables (for join attributes) and dense polymatroid inequalities: For acyclic (Berge-acyclic) queries and "simple" (single-column) statistics, the LP can be greatly reduced, or replaced by max-flow computations or bottom-up message-passing schemes that run in polynomial time (Zhang et al., 9 Feb 2025, Deeds et al., 2022, Deeds et al., 2022). For general conjunctive queries, including cyclic and group-by cases, the full LP (or alternative combinatorial shortest-path formulations) is required (Chen et al., 2021).
A representative worked example for , using - and -norm degree statistics , gives
with this LP-derived bound being typically sharper than heuristic cardinality estimates based only on average degrees (Khamis et al., 5 Mar 2025, Khamis et al., 2024).
4. Advanced Bounds: Ambidextrous and Degree-Sequence Techniques
Recent advances extend the -norm framework to richer, bivariate statistics. Ambidextrous bounds use mixed moments to derive entropy inequalities of the form
subsuming and tightening the classical "claw" -norm bounds (Lin et al., 5 Oct 2025). These tighter entropy inequalities exploit both directions of data skew, yielding upper bounds empirically 2–5 tighter over real-world graphs and multi-way joins.
The degree-sequence bound (DSB) computes the join size upper bound exactly for acyclic queries by aligning degree sequences across relations, maximizing output under worst-case correlation. Compression, piecewise-constant approximations, and fast bottom-up dynamic programming allow application of DSB in production systems (Deeds et al., 2022, Deeds et al., 2022).
5. System Implementations and Practical Impact
Systems such as SafeBound (Deeds et al., 2022) and LpBound (Zhang et al., 9 Feb 2025) instantiate pessimistic estimation at system scale, providing:
- Stateless per-relation statistics: piecewise-constant or histogrammed degree sequences; a handful of -norms; top- most-common-values for selection predicates.
- Fast, sub-millisecond inference through LP simplifications, message-passing, or compressed functional representations.
- Predicate support by decomposing degree statistics on filtered data slices: MCV-value lists, range histograms, and 3-gram decompositions for LIKE predicates.
- Empirically one-sided, order-of-magnitude tighter estimates than classical system or ML-based estimators on realistic workloads (e.g., JOB, STATS, DBLP subgraph matching). As observed, LpBound yields q-errors close to $1$– (vs $10$– in mainstream systems), and, when injected into PostgreSQL, avoids optimizer misestimates and improves or matches plan quality derived from true cardinalities (Zhang et al., 9 Feb 2025, Deeds et al., 2022).
6. Comparison to Traditional and Learned Cardinality Estimators
Traditional estimators (histogram-based, independence-assuming) are fast and composable but lack any error guarantees, frequently underestimating query result sizes especially when strong correlations, skewed degree distributions, or complex predicates are present (Khamis et al., 2024). Learned models (BayesCard, NeuroCard, DeepDB, FactorJoin, FLAT) may offer competitive mean error on in-sample queries but lack robustness guarantees, require significant training data, and cannot guarantee no underestimation (Zhang et al., 9 Feb 2025).
Pessimistic estimators, conversely, guarantee no underestimation, composability (for certain acyclic queries), and support for complex boolean predicate structures through minimum/summation over norms. They typically require only local statistics, with a modest memory footprint (a few MB per dataset), and allow for efficient plan pruning in cost-based optimizers by bounding the cost of subplans and avoiding pathological plan selection (Khamis et al., 2024, Deeds et al., 2022).
7. Limitations and Future Directions
The primary limitations of current pessimistic cardinality estimation frameworks include:
- Exponential scaling of the general LP in the number of query variables, though acyclic and many practical cyclic cases admit polynomial-time reductions (Zhang et al., 9 Feb 2025, Deeds et al., 2022).
- Looser bounds in highly skewed or cyclic queries unless richer (e.g., joint) statistics are collected.
- Maintenance cost and complexity of storing full degree sequences; compressed or histogrammed surrogates mitigate, but can weaken bounds (Deeds et al., 2022).
- Orthogonality to the complementary problem of cardinality lower bounds ("xBound" addresses the dual task of hard lower bounding; no upper-bound framework addresses underestimation directly) (Stoian et al., 19 Jan 2026).
- Open directions include tighter integration of ambidextrous and multivariate statistics in systems, extension to multi-key and cyclic queries for lower bounds, and continuous updates of statistics under high-frequency data modification.
Pessimistic cardinality estimation thus provides a rigorous, information-theoretic approach to robust query optimization, with provable safety margins, theoretical tightness in worst-case scenarios, and practical relevance as demonstrated by adoption into multiple open-source database backends and commercial query planners (Khamis et al., 5 Mar 2025, Zhang et al., 9 Feb 2025, Deeds et al., 2022).