MEraser: Semantic Data Erasure in Databases
- MEraser is a framework for provable data erasure in relational databases, rigorously preventing inference of deleted data via semantic dependencies.
- It employs demand-driven and batching algorithms to compute minimal deletion sets, effectively balancing computational cost with latency.
- Empirical evaluations show significant improvements, including up to 90% reduction in recomputation and deletion sets that are 10–100× smaller than traditional methods.
MEraser refers to a set of systems and algorithms—most notably, the one introduced in "Meaningful Data Erasure in the Presence of Dependencies"—which enforce formal, provable guarantees of data deletion in relational databases, specifically accounting for semantic dependencies such as foreign keys, functional dependencies, and more general relational dependency rules. The central novelty is a rigorous definition of erasure semantics such that, after erasing a data item, no stronger inference can be made about the erased value from the remaining database plus any background dependencies than could have been made before its insertion. This ensures regulatory compliance (e.g., GDPR), even when direct and indirect dependencies could otherwise allow inference of erased data via residual database state (Chakraborty et al., 1 Jul 2025).
1. Formal Foundations: Pre-insertion Post-Erasure Equivalence (P2E2) and Relational Dependency Rules
MEraser introduces the Pre-insertion Post-Erasure Equivalence (P2E2) guarantee. Let be a cell, and its creation and expiration timestamps, and a set of background semantic dependencies, expressed as Relational Dependency Rules (RDRs): Conditioned by an SQL-style predicate .
A grounded (instantiated) RDR takes the form:
Given database state , the dependency set includes all instantiated RDRs involving , as well as those reachable via attribute-chaining.
P2E2 requires: meaning that after erasure, the inferences about from dependencies must not exceed those available at the time of ’s pre-insertion.
This exact minimization problem (Opt-P2E2)—find the smallest set such that erasing all in guarantees P2E2 for —is NP-hard (Chakraborty et al., 1 Jul 2025).
2. Architecture and Core Algorithms
MEraser implements erasure via two main operational classes:
A. Demand-driven Erasure: Upon a deletion request, the procedure is:
- Dependency instantiation: A BFS over the attribute-dependency graph identifies all instantiated RDRs (Algorithm 1).
- Minimal deletion computation:
- If the dependency graph is acyclic:
- Hypergraph dynamic programming algorithm: bottom-up cost propagation, then top-down tracing for minimal set (optimal in aligned cases).
- For cyclic or high-arity cases:
- ILP reduction: Encodes RDR-cell constraints as a 0-1 ILP, minimizing user-supplied cost.
- Approximation (greedy) algorithm: One-pass, non-exhaustive, with approximation in uniform-cost, acyclic scenarios.
- If the dependency graph is acyclic:
B. Batching: A grace period parameter accumulates deletion requests over and optimizes all targets in a single batch. Interdependencies among to-be-deleted cells are exploited to pre-mark affected RDRs “NULL,” amortizing instantiation and solver costs.
Retention-driven Erasure addresses expiring data and derived cells with proactive batch erasure and optimal schedule “interval cover” algorithms to minimize induced recomputation.
3. Performance and Cost Trade-Offs
MEraser's design includes a spectrum of operational trade-offs:
- Grace-period batching balances user latency against instantiation/solver amortization.
- Solver choice: Hypergraph DP achieves minimal cost deletion in acyclic graphs, but greedy approximation is preferable under high-arity, cyclic, or high-throughput demands.
- Scheduler for derived data: An “interval-cover” schedule for periodic recomputation enables up to 90% reduction in unnecessary recomputation events as dependency graphs evolve.
Batching and approximation methods enable practical deployment at scale without violating P2E2 semantics, with tunable parameters for cost/latency/throughput trade-off (Chakraborty et al., 1 Jul 2025).
4. Theoretical Guarantees and Complexity
The system guarantees:
- Correctness: If the algorithm returns a deletion set for , then P2E2 holds for .
- Optimality: Exact methods (hypergraph DP in acyclic case, ILP) compute a cost-minimal .
- NP-hardness: Opt-P2E2 is NP-hard via reduction from a covering/repair problem.
- Completeness: Only dependencies on cells created after can violate P2E2, so only such newly-instantiated RDRs need tracing.
- Approximation: Greedy approximation is always feasible, and in uniform-cost, acyclic settings the deletion set size is within of optimal.
This provides robust guarantees for database operators regarding the irreversibility and leak-resistance of erasure in the presence of arbitrary dependency structure.
5. Empirical Evaluation and Operational Impact
Evaluation on both real (Twitter, HotCRP, SmartBench) and synthetic (Tax, TPC-H) datasets demonstrate:
- MEraser’s minimal deletion sets are 10–100 smaller than cascade or minimal-cover baselines.
- ILP solvers achieve minimal deletion but with 5–10 higher memory/runtime than the hypergraph DP (HGr) method.
- HGr nearly matches the greedy (Apx) in speed for low-arity settings, but the Apx algorithm is preferable for high-arity or deep-chained RDRs, achieving 2–5 speedup and only 5–20% deletion-cost overhead.
- Batching across –$6$ hours reduces model runs by 50–90% and total runtime by 30–70%.
- The retention-driven scheduling framework reduces derived data recomputation by 20–90% depending on update and expiration dynamics (Chakraborty et al., 1 Jul 2025).
These empirical results establish the practicality of rigorous semantic erasure at scale and quantify the cost/benefit landscape versus legacy and ad hoc deletion procedures.
6. Significance and Implications
MEraser defines and achieves “meaningful erasure” in relational and semantically coupled databases—precisely bounding what can be inferred post-erasure, including via arbitrary direct and indirect dependency logic. This addresses long-standing ambiguities in regulatory definitions of erasure, providing formal, enforceable semantics that preclude both trivial leakage and needlessly destructive over-deletion.
Additionally, by integrating efficient dependency reasoning (instantiation, hypergraph/ILP/greedy algorithms) and automated batching/scheduling, MEraser supports scalable operational deployment across a breadth of real-world database workloads. The framework supplies a foundational tool for provable deletion and privacy compliance under modern data governance regimes, and establishes a technical reference for subsequent designs in rigorous data sanitization and semantic access control (Chakraborty et al., 1 Jul 2025).