SISA Training for Efficient Machine Unlearning

Updated 23 February 2026

SISA Training is a machine unlearning framework that partitions data into shards and slices for precise, localized model updates upon data deletion.
It employs checkpointing and aggregation strategies to enable exact unlearning with enhanced computational efficiency compared to retraining from scratch.
Variants like SISA-FC, SISA-A, and SISA++ adapt the approach for different domains, balancing memory usage, speed, and accuracy trade-offs.

SISA training refers to the Sharded, Isolated, Sliced, and Aggregated framework for efficient machine unlearning—enabling the precise and computationally tractable removal of individual data points from stateful models, especially those trained via stochastic gradient descent. SISA explicitly structures the training process so that the influence of any training example is confined to a limited, highly local segment of model state. SISA and its variants provide unlearning guarantees and superior efficiency relative to naïve retraining in diverse data domains including vision, speech, natural language, and multimodal mobility data (Bourtoule et al., 2019, Kumar et al., 2022, Phukan et al., 2 Jun 2025, Yonekura et al., 27 Aug 2025).

1. Foundations and Motivation

Machine unlearning is mandated by privacy statutes such as the GDPR’s “right to be forgotten,” which require that, upon a user's request, all traces of their data be removed not only from storage but also from any trained models. Traditional deep learning pipelines are fundamentally stateful: every gradient step is globally entangled with prior data. Consequently, unlearning by retraining from scratch is computationally prohibitive at real-world scale, especially for models with $|D| = 10^8$ and frequent deletion requests (Bourtoule et al., 2019).

SISA’s architectural innovation injects retraining-locality by partitioning the dataset into several shards, training each submodel in isolation, and implementing staged slicing within each shard to facilitate fine-grained, checkpointed state. Aggregation at inference combines the constituent models, retaining predictive performance while confining data influence.

2. Mathematical Formulation and Core Algorithms

Given a dataset $D = \{(x_i, y_i)\}_{i=1}^N$ , choose

$S$ : the number of shards
$R$ : the number of slices per shard

Partition $D$ into $S$ disjoint shards $D_1, \ldots, D_S$ , each of size $N/S$ . Within each shard $D_k$ , partition further into $R$ slices $D_{k,1}, \ldots, D_{k,R}$ with $|D_{k,r}| \approx N/(S R)$ (Bourtoule et al., 2019, Kumar et al., 2022, Yonekura et al., 27 Aug 2025).

Training

For each shard $k$ :

Initialize model parameters $\theta_k^{(0)}$ .
For $r = 1, \dots, R$ $r = 1, \dots, R$ :
- Train on slices $D_{k,1}, \ldots, D_{k,r}$ ; checkpoint $\theta_k^{(r)}$ .
The final constituent model is $M_k = \theta_k^{(R)}$ .

The per-shard empirical risk is

$L_k(\theta_k) = \frac{1}{|D_k|} \sum_{(x,y) \in D_k} \ell(M_k(x;\theta_k), y)$

where $\ell(\cdot, \cdot)$ is the appropriate loss.

Inference

Aggregate constituent models’ predictions:

Classification: Majority vote or uniform average of softmax probabilities.
Regression: Uniform mean over constituent outputs:

$\hat{y} = \frac{1}{S} \sum_{k=1}^S M_k(x; \theta_k)$

Unlearning

When a record $(x^*, y^*)$ is to be deleted:

Identify shard and slice $(k^*, r^*)$ containing $x^*$ .
Reload $\theta_{k^*}^{(r^*-1)}$ .
Remove $x^*$ from $D_{k^*,r^*}$ .
Retrain from slice $r^*$ through $R$ (on $k^*$ only), replaying with $x^*$ excluded, and replace $M_{k^*}$ .
All other shards/models remain unchanged (Bourtoule et al., 2019, Yonekura et al., 27 Aug 2025, Kumar et al., 2022).

This procedure achieves exact unlearning: the resulting model state matches that which would have been obtained had $x^*$ never been in the training set.

3. SISA Variants and Extensions

Several SISA-based and inspired strategies exist for optimization under storage, memory, and performance constraints:

SISA-FC ([SISA with Fully-Connected head]): Stores only FC-head parameters per checkpoint, yielding up to 95% memory reduction, but may incur substantial accuracy loss (20–30 points on GLUE tasks) (Kumar et al., 2022).
SISA-A ([SISA with Adapters]): Uses lightweight trainable adapters within a frozen backbone, balancing memory savings and retaining accuracy within 1–6 points (Kumar et al., 2022).
MobText-SISA: Employs similarity-aware sharding on latent embeddings for complex spatio-temporal and text mobility logs, using GMM clustering and round-robin assignment to preserve inter-shard diversity and ensure efficient rollback (Yonekura et al., 27 Aug 2025).
SISA++: Advances SISA by replacing output-ensemble aggregation with weight-averaging of constituent model parameters at inference, empirically reducing post-unlearning performance drop by landing in a “flat” low-loss region of parameter space (Phukan et al., 2 Jun 2025).

4. Empirical Results and Performance Analysis

Multiple empirical studies demonstrate SISA’s efficacy:

Data Domain	Task	Shards × Slices	Model	Speed-up	ΔAccuracy (unlearning)	Reference
Ecommerce/CV	Classification	20 × 50	MLP, Wide ResNet, ResNet-50	4.6×	–1.3 to –19.5 pp	(Bourtoule et al., 2019)
NLP (GLUE SST, QQP, MNLI)	Classification	5 × 16	BERT+Adapter, BERT+FC	10–100×	–1 to –6 pp (Adapter); –20 to –30 pp (FC head)	(Kumar et al., 2022)
Mobility	Multimodal cls	2–16 × 8	MLP over numerical+BERT embeddings	–	+5–10% RMSE if random sharding vs similarity sharding	(Yonekura et al., 27 Aug 2025)
Speech (CREMA-D, E-DAIC)	SER/DD	4–8 (K)	TRILLsson+Transformer, x-vector, XLS-R, WavLM	–	ΔAcc –2.3 (SISA++), –5.3 (SISA)	(Phukan et al., 2 Jun 2025)

SISA consistently achieves 1.3–4.6× speed-ups in unlearning over retraining from scratch (higher when S and R are large), with only modest drops in accuracy (often recoverable by proper aggregation and downstream model selection). In speech and NLP tasks, SISA-A and SISA++ further reduce degradation post-unlearning while preserving computational efficiency (Kumar et al., 2022, Phukan et al., 2 Jun 2025).

5. Practical Guidelines and Trade-Offs

Best-practice recommendations include:

Shard/slice allocation: Increasing $S$ and $R$ (smaller shards/slices) accelerates unlearning at the cost of possible accuracy loss. Empirically, $S = 5$ and $R = 8$ –$16$ balance speedup and performance in NLP; for speech, $8 \leq K \leq 16$ is optimal (Kumar et al., 2022, Phukan et al., 2 Jun 2025).
Feature representations: Use robust high-level embeddings (e.g., TRILLsson for SER/DD) for post-unlearning stability; avoid purely hand-crafted features (e.g. MFCC) when possible (Phukan et al., 2 Jun 2025).
Downstream architectures: Small Transformer encoders preserve global structure more stably than CNN or SVM for SISA-based post-unlearning scenarios (Phukan et al., 2 Jun 2025).
Checkpoint management: Storing $S \cdot R$ model or adapter checkpoints is feasible given their small size. Adapter-based or head-only checkpointing strongly reduces memory demand (Kumar et al., 2022).
Aggregation at inference: Averaging softmax probabilities across shard/slice models recovers accuracy lost to small shard sizes. Weight averaging (SISA++) further stabilizes performance post-unlearning (Phukan et al., 2 Jun 2025).
Distribution-aware partitioning: Probabilistically cluster high-deletion-probability points into small shards (“distribution-aware sharding”) to further concentrate retraining effort and reduce expected cost (Bourtoule et al., 2019, Yonekura et al., 27 Aug 2025).

6. Applications and Domain-Specific Adaptations

SISA has been applied and/or extended in:

Computer Vision: Canonical benchmarks (e.g., MNIST, SVHN, ImageNet) (Bourtoule et al., 2019).
NLP: GLUE tasks (SST, QQP, MNLI), leveraging either FC-head or adapter checkpoints (Kumar et al., 2022).
Speech Processing: SER (CREMA-D), Depression Detection (E-DAIC), with distinct embedding and architecture choices (Phukan et al., 2 Jun 2025).
Mobility Analytics: Urban-scale GPS and text logs with similarity-based sharding (Yonekura et al., 27 Aug 2025).

Each domain presents task-specific tradeoffs among checkpoint size, aggregation strategy, and feature robustness. The ability to apply SISA in a model-agnostic fashion (via checkpointing and consensus aggregation) underpins its extensibility.

7. Limitations and Theoretical Guarantees

SISA achieves provable exact unlearning: after removing $x^*$ , the model distribution matches the counterfactual in which $x^*$ was never part of training, conditional on independence among shards and strict isolation during training and unlearning (Bourtoule et al., 2019, Yonekura et al., 27 Aug 2025). However, very small shards may induce model overfitting or excessive statistical inefficiency. Excessively granular checkpointing inflates disk storage, though this is mitigated by using lightweight adapters or FC-heads. Empirically, most performance loss from SISA arises when per-shard data is heavily subsampled relative to model capacity (Kumar et al., 2022), but this can be controlled by task-specific hyperparameter tuning and proper inference aggregation.

In summary, SISA training—alongside its adapter- and weight-averaging-based variants—constitutes the dominant methodological framework for machine unlearning at scale across diverse modern ML domains, providing both operational guarantees and practical deployment efficiency (Bourtoule et al., 2019, Kumar et al., 2022, Phukan et al., 2 Jun 2025, Yonekura et al., 27 Aug 2025).

Markdown Report Issue Upgrade to Chat

References (4)

Machine Unlearning (2019)

Privacy Adhering Machine Un-learning in NLP (2022)

Towards Machine Unlearning for Paralinguistic Speech Processing (2025)

MobText-SISA: Efficient Machine Unlearning for Mobility Logs with Spatio-Temporal and Natural-Language Data (2025)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to SISA Training.