Papers
Topics
Authors
Recent
Search
2000 character limit reached

SISA Training for Efficient Machine Unlearning

Updated 23 February 2026
  • SISA Training is a machine unlearning framework that partitions data into shards and slices for precise, localized model updates upon data deletion.
  • It employs checkpointing and aggregation strategies to enable exact unlearning with enhanced computational efficiency compared to retraining from scratch.
  • Variants like SISA-FC, SISA-A, and SISA++ adapt the approach for different domains, balancing memory usage, speed, and accuracy trade-offs.

SISA training refers to the Sharded, Isolated, Sliced, and Aggregated framework for efficient machine unlearning—enabling the precise and computationally tractable removal of individual data points from stateful models, especially those trained via stochastic gradient descent. SISA explicitly structures the training process so that the influence of any training example is confined to a limited, highly local segment of model state. SISA and its variants provide unlearning guarantees and superior efficiency relative to naïve retraining in diverse data domains including vision, speech, natural language, and multimodal mobility data (Bourtoule et al., 2019, Kumar et al., 2022, Phukan et al., 2 Jun 2025, Yonekura et al., 27 Aug 2025).

1. Foundations and Motivation

Machine unlearning is mandated by privacy statutes such as the GDPR’s “right to be forgotten,” which require that, upon a user's request, all traces of their data be removed not only from storage but also from any trained models. Traditional deep learning pipelines are fundamentally stateful: every gradient step is globally entangled with prior data. Consequently, unlearning by retraining from scratch is computationally prohibitive at real-world scale, especially for models with D=108|D| = 10^8 and frequent deletion requests (Bourtoule et al., 2019).

SISA’s architectural innovation injects retraining-locality by partitioning the dataset into several shards, training each submodel in isolation, and implementing staged slicing within each shard to facilitate fine-grained, checkpointed state. Aggregation at inference combines the constituent models, retaining predictive performance while confining data influence.

2. Mathematical Formulation and Core Algorithms

Given a dataset D={(xi,yi)}i=1ND = \{(x_i, y_i)\}_{i=1}^N, choose

  • SS: the number of shards
  • RR: the number of slices per shard

Partition DD into SS disjoint shards D1,,DSD_1, \ldots, D_S, each of size N/SN/S. Within each shard DkD_k, partition further into RR slices Dk,1,,Dk,RD_{k,1}, \ldots, D_{k,R} with Dk,rN/(SR)|D_{k,r}| \approx N/(S R) (Bourtoule et al., 2019, Kumar et al., 2022, Yonekura et al., 27 Aug 2025).

Training

For each shard kk:

  • Initialize model parameters θk(0)\theta_k^{(0)}.
  • For r=1,,Rr = 1, \dots, R:
    • Train on slices Dk,1,,Dk,rD_{k,1}, \ldots, D_{k,r}; checkpoint θk(r)\theta_k^{(r)}.
  • The final constituent model is Mk=θk(R)M_k = \theta_k^{(R)}.

The per-shard empirical risk is

Lk(θk)=1Dk(x,y)Dk(Mk(x;θk),y)L_k(\theta_k) = \frac{1}{|D_k|} \sum_{(x,y) \in D_k} \ell(M_k(x;\theta_k), y)

where (,)\ell(\cdot, \cdot) is the appropriate loss.

Inference

Aggregate constituent models’ predictions:

  • Classification: Majority vote or uniform average of softmax probabilities.
  • Regression: Uniform mean over constituent outputs:

y^=1Sk=1SMk(x;θk)\hat{y} = \frac{1}{S} \sum_{k=1}^S M_k(x; \theta_k)

Unlearning

When a record (x,y)(x^*, y^*) is to be deleted:

  • Identify shard and slice (k,r)(k^*, r^*) containing xx^*.
  • Reload θk(r1)\theta_{k^*}^{(r^*-1)}.
  • Remove xx^* from Dk,rD_{k^*,r^*}.
  • Retrain from slice rr^* through RR (on kk^* only), replaying with xx^* excluded, and replace MkM_{k^*}.
  • All other shards/models remain unchanged (Bourtoule et al., 2019, Yonekura et al., 27 Aug 2025, Kumar et al., 2022).

This procedure achieves exact unlearning: the resulting model state matches that which would have been obtained had xx^* never been in the training set.

3. SISA Variants and Extensions

Several SISA-based and inspired strategies exist for optimization under storage, memory, and performance constraints:

  • SISA-FC ([SISA with Fully-Connected head]): Stores only FC-head parameters per checkpoint, yielding up to 95% memory reduction, but may incur substantial accuracy loss (20–30 points on GLUE tasks) (Kumar et al., 2022).
  • SISA-A ([SISA with Adapters]): Uses lightweight trainable adapters within a frozen backbone, balancing memory savings and retaining accuracy within 1–6 points (Kumar et al., 2022).
  • MobText-SISA: Employs similarity-aware sharding on latent embeddings for complex spatio-temporal and text mobility logs, using GMM clustering and round-robin assignment to preserve inter-shard diversity and ensure efficient rollback (Yonekura et al., 27 Aug 2025).
  • SISA++: Advances SISA by replacing output-ensemble aggregation with weight-averaging of constituent model parameters at inference, empirically reducing post-unlearning performance drop by landing in a “flat” low-loss region of parameter space (Phukan et al., 2 Jun 2025).

4. Empirical Results and Performance Analysis

Multiple empirical studies demonstrate SISA’s efficacy:

Data Domain Task Shards × Slices Model Speed-up ΔAccuracy (unlearning) Reference
Ecommerce/CV Classification 20 × 50 MLP, Wide ResNet, ResNet-50 4.6× –1.3 to –19.5 pp (Bourtoule et al., 2019)
NLP (GLUE SST, QQP, MNLI) Classification 5 × 16 BERT+Adapter, BERT+FC 10–100× –1 to –6 pp (Adapter); –20 to –30 pp (FC head) (Kumar et al., 2022)
Mobility Multimodal cls 2–16 × 8 MLP over numerical+BERT embeddings +5–10% RMSE if random sharding vs similarity sharding (Yonekura et al., 27 Aug 2025)
Speech (CREMA-D, E-DAIC) SER/DD 4–8 (K) TRILLsson+Transformer, x-vector, XLS-R, WavLM ΔAcc –2.3 (SISA++), –5.3 (SISA) (Phukan et al., 2 Jun 2025)

SISA consistently achieves 1.3–4.6× speed-ups in unlearning over retraining from scratch (higher when S and R are large), with only modest drops in accuracy (often recoverable by proper aggregation and downstream model selection). In speech and NLP tasks, SISA-A and SISA++ further reduce degradation post-unlearning while preserving computational efficiency (Kumar et al., 2022, Phukan et al., 2 Jun 2025).

5. Practical Guidelines and Trade-Offs

Best-practice recommendations include:

  • Shard/slice allocation: Increasing SS and RR (smaller shards/slices) accelerates unlearning at the cost of possible accuracy loss. Empirically, S=5S = 5 and R=8R = 8–$16$ balance speedup and performance in NLP; for speech, 8K168 \leq K \leq 16 is optimal (Kumar et al., 2022, Phukan et al., 2 Jun 2025).
  • Feature representations: Use robust high-level embeddings (e.g., TRILLsson for SER/DD) for post-unlearning stability; avoid purely hand-crafted features (e.g. MFCC) when possible (Phukan et al., 2 Jun 2025).
  • Downstream architectures: Small Transformer encoders preserve global structure more stably than CNN or SVM for SISA-based post-unlearning scenarios (Phukan et al., 2 Jun 2025).
  • Checkpoint management: Storing SRS \cdot R model or adapter checkpoints is feasible given their small size. Adapter-based or head-only checkpointing strongly reduces memory demand (Kumar et al., 2022).
  • Aggregation at inference: Averaging softmax probabilities across shard/slice models recovers accuracy lost to small shard sizes. Weight averaging (SISA++) further stabilizes performance post-unlearning (Phukan et al., 2 Jun 2025).
  • Distribution-aware partitioning: Probabilistically cluster high-deletion-probability points into small shards (“distribution-aware sharding”) to further concentrate retraining effort and reduce expected cost (Bourtoule et al., 2019, Yonekura et al., 27 Aug 2025).

6. Applications and Domain-Specific Adaptations

SISA has been applied and/or extended in:

Each domain presents task-specific tradeoffs among checkpoint size, aggregation strategy, and feature robustness. The ability to apply SISA in a model-agnostic fashion (via checkpointing and consensus aggregation) underpins its extensibility.

7. Limitations and Theoretical Guarantees

SISA achieves provable exact unlearning: after removing xx^*, the model distribution matches the counterfactual in which xx^* was never part of training, conditional on independence among shards and strict isolation during training and unlearning (Bourtoule et al., 2019, Yonekura et al., 27 Aug 2025). However, very small shards may induce model overfitting or excessive statistical inefficiency. Excessively granular checkpointing inflates disk storage, though this is mitigated by using lightweight adapters or FC-heads. Empirically, most performance loss from SISA arises when per-shard data is heavily subsampled relative to model capacity (Kumar et al., 2022), but this can be controlled by task-specific hyperparameter tuning and proper inference aggregation.

In summary, SISA training—alongside its adapter- and weight-averaging-based variants—constitutes the dominant methodological framework for machine unlearning at scale across diverse modern ML domains, providing both operational guarantees and practical deployment efficiency (Bourtoule et al., 2019, Kumar et al., 2022, Phukan et al., 2 Jun 2025, Yonekura et al., 27 Aug 2025).

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to SISA Training.