Papers
Topics
Authors
Recent
2000 character limit reached

Cardinality Estimation Pruning (CEP)

Updated 2 December 2025
  • CEP is an unlearning framework that eliminates the impact of deleted records in multi-table cardinality estimators by pruning sensitive parameters and supports.
  • It employs distribution-sensitivity scores and Fisher information to identify the model components most affected by data deletions, effectively addressing inter-table dependencies.
  • CEP incorporates domain pruning to remove input supports for vanished attribute values, ensuring accurate selectivity estimates and compliance with data regulations.

Cardinality Estimation Pruning (CEP) is an unlearning framework targeting the removal of deleted-data influence from learned multi-table cardinality estimation (CE) models. CEP is specifically designed to handle the unique distributional dependencies of multi-table relational data in systems such as NeuroCard and FACE, enabling efficient and accurate model adaptation to data deletions without resorting to full retraining. This approach addresses core challenges in machine unlearning for CE, including attribute-level sensitivity, inter-table propagation, and domain disappearance, which are critical for regulatory compliance (e.g., GDPR/CCPA) and data integrity in database management contexts (He et al., 25 Nov 2025).

1. Motivation and Problem Setting

Learned cardinality estimators must continuously adapt their selectivity estimates to reflect data deletions, ensuring that the learned models do not retain influence from expunged records. In multi-table settings, three central challenges arise:

  • Attribute-level sensitivity: Deletions may entirely remove rare attribute values, exposing the estimator to severe distributional shifts.
  • Inter-table propagation: Due to foreign-key joins, deletions in one table can cause cascading effects across multi-way joins, altering joint distributions non-locally.
  • Domain disappearance: When specific attribute values are completely eliminated, failure to reallocate probability mass leads to dramatic overestimation of join cardinalities.

CEP is constructed to address these challenges by systematically pruning both parameters and input supports relevant to the deleted data, achieving efficient unlearning aligned with distributional changes.

2. Distribution Sensitivity Pruning

Distribution Sensitivity Pruning is the first core component of CEP, isolating and pruning model parameters highly sensitive to deleted records. This is accomplished via distribution-aware sensitivity metrics and the use of diagnostic join samples.

Attribute Sensitivity Scores

Given attribute AiA^i with original and post-deletion empirical pmfs PiP^i and PriP_r^i, respectively, sensitivity is quantified as:

Si=PiPriPriS_i = \frac{|P^i - P_r^i|}{P_r^i}

This ratio emphasizes rare or substantially altered values, imposing higher weight in subsequent loss computations.

Pruning-Aware Loss Functions

L~(x)=i=1DSilogp(AiA<i)\tilde{L}(x) = -\sum_{i=1}^D S_i \cdot \log p(A^i | A^{<i})

  • Normalizing flows (e.g., FACE) apply a sample-level sensitivity sum as a multiplicative weight to the loss:

L~(x)=L(x)(iCols(x)Si)\tilde{L}(x) = L(x) \cdot \left(\sum_{i \in Cols(x)} S_i\right)

Fisher-Diagonal Importance and Semi-Join Sampling

Fisher information is diagonally approximated on deleted data DdD_d to measure each parameter θ\theta's contribution:

I(L~,Dd)=ExDd[(L~(x)θ)2]I(\tilde{L}, D_d) = \mathbb{E}_{x\in D_d} \left[ \left( \frac{\partial \tilde{L}(x)}{\partial \theta} \right)^2 \right]

Semi-join deletion results are constructed for each table TkT^k:

Jdk=T1TdkTnJ_d^k = T^1 \bowtie \cdots \bowtie T_d^k \bowtie \cdots \bowtie T^n

where TdkT_d^k represents deleted tuples. Sampling from JdkJ_d^k captures the widespread distributional impact of the deletion.

Parameter Pruning Procedure

A total pruning budget α\alpha is evenly allocated across tables. Iterative magnitude pruning zeroes the αk\alpha_k-most important parameters per table, as ranked by Fisher importance, ensuring that removed parameters are those most responsible for deleted data modeling.

Algorithm Summary

Step Description Output
Compute SiS_i Attribute-wise sensitivity Sensitivity weights
Sample from JdkJ_d^k Semi-join view sampling Distribution-shifted minibatches
Accumulate IkI_k Fisher-diagnostic per table Parameter importance vector
Prune top αk\alpha_k entries Magnitude-based zeroing Deleted-data-agnostic parameters

The overall time complexity is

O(KNs(C+P)+KPlogP)O\left(K N_s (C + P) + K P \log P\right)

where PP is parameter count, CC is per-batch compute, KK number of tables, and NsN_s sampling steps. In practice, Ns(C+P)PlogPN_s(C+P) \gg P\log P dominates.

3. Domain Pruning

Domain Pruning directly removes input support for attribute values completely eliminated from the remaining dataset, resolving the overestimation issue caused by probability mass being assigned to vanished domains.

Detection and Removal

  • Categorical Attributes: The embedding matrix EiRd×hE^i \in \mathbb{R}^{d \times h} is pruned by removing columns corresponding to disappeared attribute values Dom(Adi)=Dom(Ai)Dom(Ari)Dom(A_d^i) = Dom(A^i) \setminus Dom(A_r^i), yielding Eprunedi=Ei[:,Dom(Ari)]E^i_{\text{pruned}} = E^i[:, Dom(A_r^i)].
  • Numerical Attributes: The original range [l,h][l, h] is restricted to retained intervals {[aj,bj]}\{[a_j, b_j]\}, and mapping is adjusted so that out-of-support queries are clamped to valid subspace, preventing spurious responses on gaps.

This operation eliminates any remnants of deleted attribute values from both model input and parametrization.

4. Complete CEP Workflow

The CEP algorithm proceeds as follows:

  1. Compute retained attribute histograms and sensitivity scores SiS_i.
  2. Execute Distribution Sensitivity Pruning to obtain updated parameters θu\theta^u.
  3. Apply Domain Pruning to excise vanished value supports from input features.
  4. Fine-tune the pruned model briefly on the retained dataset DrD_r to restore selectivity estimation quality.

Executing these steps ensures that dependencies and supports on deleted data are excised prior to any retraining, minimizing the risk of lingering influence and enabling efficient convergence (He et al., 25 Nov 2025).

5. Experimental Evaluation

CEP was evaluated using two state-of-the-art multi-table CE architectures:

  • NeuroCard (autoregressive)
  • FACE (normalizing flow)

across the IMDB (6 tables, JOB-light workload) and TPC-H (4 tables) datasets. Baselines included no-adaptation (Stale), full Retrain, and light Fine-Tune. Deletion scenarios encompassed both attribute-targeted (AA-xx-ρ\rho) and random (RR-xx-ρ\rho), parameterized by the number of affected tables and deletion ratio ρ\rho.

Performance was assessed using Q-error percentiles (50th through 99th) on both original (OQ) and complement (CQ) queries.

Key findings:

Condition Q-Error (CEP) Q-Error (Retrain) Q-Error (Fine-Tune)
JOB-light (A-1, OQ 50th) 1.21 1.43 Failure (Q99 ≈ 5142)
NeuroCard (A-6, Q99) 4.84 21.84 4168
FACE (A-6, Q99) 24.70 41,100 --

CEP achieved lower or comparable Q-error to full retraining, especially at high deletion, and converged in fewer epochs. Pruning required only 0.3%–2.5% of the fine-tuning cost.

6. Ablations, Insights, and Limitations

Ablation studies revealed that:

  • Domain Pruning alone (CEP-D) yields large tail errors (e.g., NeuroCard Q99≈2155), confirming that support removal without parameter pruning is insufficient.
  • Sensitivity Pruning alone (CEP-S) offers moderate improvements but cannot prevent overestimation when domains disappear.
  • Only the full combination achieves minimum Q-error across regimes. Incorporating Domain Pruning into baselines (FT+D, Retrain+D) substantially reduces high quantile errors, indicating its critical role.

CEP currently addresses deletions; extension to insertions or updates necessitates new sensitivity metrics and potentially dynamic subspace pruning. Incorporation into entire query optimizers and end-to-end workload evaluation remains open.

A plausible implication is that CEP’s sparsification effect, occasionally outstripping retraining, may be explained by mechanisms analogous to the lottery ticket hypothesis. CEP stands as the first targeted unlearning solution tailored for distribution shifts in multi-table CE, establishing its utility for data deletion compliance and efficient model maintenance in database systems (He et al., 25 Nov 2025).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)
Slide Deck Streamline Icon: https://streamlinehq.com

Whiteboard

Forward Email Streamline Icon: https://streamlinehq.com

Follow Topic

Get notified by email when new papers are published related to Cardinality Estimation Pruning (CEP).