Cardinality Estimation Pruning (CEP)

Updated 2 December 2025

CEP is an unlearning framework that eliminates the impact of deleted records in multi-table cardinality estimators by pruning sensitive parameters and supports.
It employs distribution-sensitivity scores and Fisher information to identify the model components most affected by data deletions, effectively addressing inter-table dependencies.
CEP incorporates domain pruning to remove input supports for vanished attribute values, ensuring accurate selectivity estimates and compliance with data regulations.

Cardinality Estimation Pruning (CEP) is an unlearning framework targeting the removal of deleted-data influence from learned multi-table cardinality estimation (CE) models. CEP is specifically designed to handle the unique distributional dependencies of multi-table relational data in systems such as NeuroCard and FACE, enabling efficient and accurate model adaptation to data deletions without resorting to full retraining. This approach addresses core challenges in machine unlearning for CE, including attribute-level sensitivity, inter-table propagation, and domain disappearance, which are critical for regulatory compliance (e.g., GDPR/CCPA) and data integrity in database management contexts (He et al., 25 Nov 2025).

1. Motivation and Problem Setting

Learned cardinality estimators must continuously adapt their selectivity estimates to reflect data deletions, ensuring that the learned models do not retain influence from expunged records. In multi-table settings, three central challenges arise:

Attribute-level sensitivity: Deletions may entirely remove rare attribute values, exposing the estimator to severe distributional shifts.
Inter-table propagation: Due to foreign-key joins, deletions in one table can cause cascading effects across multi-way joins, altering joint distributions non-locally.
Domain disappearance: When specific attribute values are completely eliminated, failure to reallocate probability mass leads to dramatic overestimation of join cardinalities.

CEP is constructed to address these challenges by systematically pruning both parameters and input supports relevant to the deleted data, achieving efficient unlearning aligned with distributional changes.

2. Distribution Sensitivity Pruning

Distribution Sensitivity Pruning is the first core component of CEP, isolating and pruning model parameters highly sensitive to deleted records. This is accomplished via distribution-aware sensitivity metrics and the use of diagnostic join samples.

Attribute Sensitivity Scores

Given attribute $A^i$ with original and post-deletion empirical pmfs $P^i$ and $P_r^i$ , respectively, sensitivity is quantified as:

$S_i = \frac{|P^i - P_r^i|}{P_r^i}$

This ratio emphasizes rare or substantially altered values, imposing higher weight in subsequent loss computations.

Pruning-Aware Loss Functions

Autoregressive models (e.g., NeuroCard) receive a reweighted negative log-likelihood:

$\tilde{L}(x) = -\sum_{i=1}^D S_i \cdot \log p(A^i | A^{<i})$

Normalizing flows (e.g., FACE) apply a sample-level sensitivity sum as a multiplicative weight to the loss:

$\tilde{L}(x) = L(x) \cdot \left(\sum_{i \in Cols(x)} S_i\right)$

Fisher-Diagonal Importance and Semi-Join Sampling

Fisher information is diagonally approximated on deleted data $D_d$ to measure each parameter $\theta$ 's contribution:

$I(\tilde{L}, D_d) = \mathbb{E}_{x\in D_d} \left[ \left( \frac{\partial \tilde{L}(x)}{\partial \theta} \right)^2 \right]$

Semi-join deletion results are constructed for each table $T^k$ :

$J_d^k = T^1 \bowtie \cdots \bowtie T_d^k \bowtie \cdots \bowtie T^n$

where $T_d^k$ represents deleted tuples. Sampling from $J_d^k$ captures the widespread distributional impact of the deletion.

Parameter Pruning Procedure

A total pruning budget $\alpha$ is evenly allocated across tables. Iterative magnitude pruning zeroes the $\alpha_k$ -most important parameters per table, as ranked by Fisher importance, ensuring that removed parameters are those most responsible for deleted data modeling.

Algorithm Summary

Step	Description	Output
Compute $S_i$	Attribute-wise sensitivity	Sensitivity weights
Sample from $J_d^k$	Semi-join view sampling	Distribution-shifted minibatches
Accumulate $I_k$	Fisher-diagnostic per table	Parameter importance vector
Prune top $\alpha_k$ entries	Magnitude-based zeroing	Deleted-data-agnostic parameters

The overall time complexity is

$O\left(K N_s (C + P) + K P \log P\right)$

where $P$ is parameter count, $C$ is per-batch compute, $K$ number of tables, and $N_s$ sampling steps. In practice, $N_s(C+P) \gg P\log P$ dominates.

3. Domain Pruning

Domain Pruning directly removes input support for attribute values completely eliminated from the remaining dataset, resolving the overestimation issue caused by probability mass being assigned to vanished domains.

Detection and Removal

Categorical Attributes: The embedding matrix $E^i \in \mathbb{R}^{d \times h}$ is pruned by removing columns corresponding to disappeared attribute values $Dom(A_d^i) = Dom(A^i) \setminus Dom(A_r^i)$ , yielding $E^i_{\text{pruned}} = E^i[:, Dom(A_r^i)]$ .
Numerical Attributes: The original range $[l, h]$ is restricted to retained intervals $\{[a_j, b_j]\}$ , and mapping is adjusted so that out-of-support queries are clamped to valid subspace, preventing spurious responses on gaps.

This operation eliminates any remnants of deleted attribute values from both model input and parametrization.

4. Complete CEP Workflow

The CEP algorithm proceeds as follows:

Compute retained attribute histograms and sensitivity scores $S_i$ .
Execute Distribution Sensitivity Pruning to obtain updated parameters $\theta^u$ .
Apply Domain Pruning to excise vanished value supports from input features.
Fine-tune the pruned model briefly on the retained dataset $D_r$ to restore selectivity estimation quality.

Executing these steps ensures that dependencies and supports on deleted data are excised prior to any retraining, minimizing the risk of lingering influence and enabling efficient convergence (He et al., 25 Nov 2025).

5. Experimental Evaluation

CEP was evaluated using two state-of-the-art multi-table CE architectures:

NeuroCard (autoregressive)
FACE (normalizing flow)

across the IMDB (6 tables, JOB-light workload) and TPC-H (4 tables) datasets. Baselines included no-adaptation (Stale), full Retrain, and light Fine-Tune. Deletion scenarios encompassed both attribute-targeted ( $A$ - $x$ - $\rho$ ) and random ( $R$ - $x$ - $\rho$ ), parameterized by the number of affected tables and deletion ratio $\rho$ .

Performance was assessed using Q-error percentiles (50th through 99th) on both original (OQ) and complement (CQ) queries.

Key findings:

Condition	Q-Error (CEP)	Q-Error (Retrain)	Q-Error (Fine-Tune)
JOB-light (A-1, OQ 50th)	1.21	1.43	Failure (Q99 ≈ 5142)
NeuroCard (A-6, Q99)	4.84	21.84	4168
FACE (A-6, Q99)	24.70	41,100	--

CEP achieved lower or comparable Q-error to full retraining, especially at high deletion, and converged in fewer epochs. Pruning required only 0.3%–2.5% of the fine-tuning cost.

6. Ablations, Insights, and Limitations

Ablation studies revealed that:

Domain Pruning alone (CEP-D) yields large tail errors (e.g., NeuroCard Q99≈2155), confirming that support removal without parameter pruning is insufficient.
Sensitivity Pruning alone (CEP-S) offers moderate improvements but cannot prevent overestimation when domains disappear.
Only the full combination achieves minimum Q-error across regimes. Incorporating Domain Pruning into baselines (FT+D, Retrain+D) substantially reduces high quantile errors, indicating its critical role.

CEP currently addresses deletions; extension to insertions or updates necessitates new sensitivity metrics and potentially dynamic subspace pruning. Incorporation into entire query optimizers and end-to-end workload evaluation remains open.

A plausible implication is that CEP’s sparsification effect, occasionally outstripping retraining, may be explained by mechanisms analogous to the lottery ticket hypothesis. CEP stands as the first targeted unlearning solution tailored for distribution shifts in multi-table CE, establishing its utility for data deletion compliance and efficient model maintenance in database systems (He et al., 25 Nov 2025).

Markdown Upgrade to Chat

References (1)

Forgetting by Pruning: Data Deletion in Join Cardinality Estimation (2025)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Cardinality Estimation Pruning (CEP).