Cardinality Estimation Pruning (CEP)
- CEP is an unlearning framework that eliminates the impact of deleted records in multi-table cardinality estimators by pruning sensitive parameters and supports.
- It employs distribution-sensitivity scores and Fisher information to identify the model components most affected by data deletions, effectively addressing inter-table dependencies.
- CEP incorporates domain pruning to remove input supports for vanished attribute values, ensuring accurate selectivity estimates and compliance with data regulations.
Cardinality Estimation Pruning (CEP) is an unlearning framework targeting the removal of deleted-data influence from learned multi-table cardinality estimation (CE) models. CEP is specifically designed to handle the unique distributional dependencies of multi-table relational data in systems such as NeuroCard and FACE, enabling efficient and accurate model adaptation to data deletions without resorting to full retraining. This approach addresses core challenges in machine unlearning for CE, including attribute-level sensitivity, inter-table propagation, and domain disappearance, which are critical for regulatory compliance (e.g., GDPR/CCPA) and data integrity in database management contexts (He et al., 25 Nov 2025).
1. Motivation and Problem Setting
Learned cardinality estimators must continuously adapt their selectivity estimates to reflect data deletions, ensuring that the learned models do not retain influence from expunged records. In multi-table settings, three central challenges arise:
- Attribute-level sensitivity: Deletions may entirely remove rare attribute values, exposing the estimator to severe distributional shifts.
- Inter-table propagation: Due to foreign-key joins, deletions in one table can cause cascading effects across multi-way joins, altering joint distributions non-locally.
- Domain disappearance: When specific attribute values are completely eliminated, failure to reallocate probability mass leads to dramatic overestimation of join cardinalities.
CEP is constructed to address these challenges by systematically pruning both parameters and input supports relevant to the deleted data, achieving efficient unlearning aligned with distributional changes.
2. Distribution Sensitivity Pruning
Distribution Sensitivity Pruning is the first core component of CEP, isolating and pruning model parameters highly sensitive to deleted records. This is accomplished via distribution-aware sensitivity metrics and the use of diagnostic join samples.
Attribute Sensitivity Scores
Given attribute with original and post-deletion empirical pmfs and , respectively, sensitivity is quantified as:
This ratio emphasizes rare or substantially altered values, imposing higher weight in subsequent loss computations.
Pruning-Aware Loss Functions
- Autoregressive models (e.g., NeuroCard) receive a reweighted negative log-likelihood:
- Normalizing flows (e.g., FACE) apply a sample-level sensitivity sum as a multiplicative weight to the loss:
Fisher-Diagonal Importance and Semi-Join Sampling
Fisher information is diagonally approximated on deleted data to measure each parameter 's contribution:
Semi-join deletion results are constructed for each table :
where represents deleted tuples. Sampling from captures the widespread distributional impact of the deletion.
Parameter Pruning Procedure
A total pruning budget is evenly allocated across tables. Iterative magnitude pruning zeroes the -most important parameters per table, as ranked by Fisher importance, ensuring that removed parameters are those most responsible for deleted data modeling.
Algorithm Summary
| Step | Description | Output |
|---|---|---|
| Compute | Attribute-wise sensitivity | Sensitivity weights |
| Sample from | Semi-join view sampling | Distribution-shifted minibatches |
| Accumulate | Fisher-diagnostic per table | Parameter importance vector |
| Prune top entries | Magnitude-based zeroing | Deleted-data-agnostic parameters |
The overall time complexity is
where is parameter count, is per-batch compute, number of tables, and sampling steps. In practice, dominates.
3. Domain Pruning
Domain Pruning directly removes input support for attribute values completely eliminated from the remaining dataset, resolving the overestimation issue caused by probability mass being assigned to vanished domains.
Detection and Removal
- Categorical Attributes: The embedding matrix is pruned by removing columns corresponding to disappeared attribute values , yielding .
- Numerical Attributes: The original range is restricted to retained intervals , and mapping is adjusted so that out-of-support queries are clamped to valid subspace, preventing spurious responses on gaps.
This operation eliminates any remnants of deleted attribute values from both model input and parametrization.
4. Complete CEP Workflow
The CEP algorithm proceeds as follows:
- Compute retained attribute histograms and sensitivity scores .
- Execute Distribution Sensitivity Pruning to obtain updated parameters .
- Apply Domain Pruning to excise vanished value supports from input features.
- Fine-tune the pruned model briefly on the retained dataset to restore selectivity estimation quality.
Executing these steps ensures that dependencies and supports on deleted data are excised prior to any retraining, minimizing the risk of lingering influence and enabling efficient convergence (He et al., 25 Nov 2025).
5. Experimental Evaluation
CEP was evaluated using two state-of-the-art multi-table CE architectures:
- NeuroCard (autoregressive)
- FACE (normalizing flow)
across the IMDB (6 tables, JOB-light workload) and TPC-H (4 tables) datasets. Baselines included no-adaptation (Stale), full Retrain, and light Fine-Tune. Deletion scenarios encompassed both attribute-targeted (--) and random (--), parameterized by the number of affected tables and deletion ratio .
Performance was assessed using Q-error percentiles (50th through 99th) on both original (OQ) and complement (CQ) queries.
Key findings:
| Condition | Q-Error (CEP) | Q-Error (Retrain) | Q-Error (Fine-Tune) |
|---|---|---|---|
| JOB-light (A-1, OQ 50th) | 1.21 | 1.43 | Failure (Q99 ≈ 5142) |
| NeuroCard (A-6, Q99) | 4.84 | 21.84 | 4168 |
| FACE (A-6, Q99) | 24.70 | 41,100 | -- |
CEP achieved lower or comparable Q-error to full retraining, especially at high deletion, and converged in fewer epochs. Pruning required only 0.3%–2.5% of the fine-tuning cost.
6. Ablations, Insights, and Limitations
Ablation studies revealed that:
- Domain Pruning alone (CEP-D) yields large tail errors (e.g., NeuroCard Q99≈2155), confirming that support removal without parameter pruning is insufficient.
- Sensitivity Pruning alone (CEP-S) offers moderate improvements but cannot prevent overestimation when domains disappear.
- Only the full combination achieves minimum Q-error across regimes. Incorporating Domain Pruning into baselines (FT+D, Retrain+D) substantially reduces high quantile errors, indicating its critical role.
CEP currently addresses deletions; extension to insertions or updates necessitates new sensitivity metrics and potentially dynamic subspace pruning. Incorporation into entire query optimizers and end-to-end workload evaluation remains open.
A plausible implication is that CEP’s sparsification effect, occasionally outstripping retraining, may be explained by mechanisms analogous to the lottery ticket hypothesis. CEP stands as the first targeted unlearning solution tailored for distribution shifts in multi-table CE, establishing its utility for data deletion compliance and efficient model maintenance in database systems (He et al., 25 Nov 2025).