Approximate Class Unlearning Methods

Updated 22 December 2025

Approximate class unlearning methods are frameworks designed to remove the influence of specific class-labeled data from machine learning models without retraining from scratch.
They employ strategies like gradient ascent, alternating optimization, and feature space manipulation to address practical mismatches between reported labels and underlying concepts.
Empirical results on datasets such as CIFAR-10 and Tiny-ImageNet show that these methods closely match retraining performance while minimizing computational cost and privacy risks.

Approximate class unlearning methods are algorithmic frameworks designed to efficiently remove the influence of specific class-labeled data from trained machine learning models, with the goal of making the resulting model behave as though it was trained without the forgotten class. These methods avoid the computational expense of full retraining and address the practical constraints, theoretical subtleties, and privacy requirements associated with data and class erasure in deployed models. The field has evolved from simple fine-tuning and gradient ascent variants to frameworks specifically designed for class-concept misalignment, adversarial privacy leakage, and certified removal guarantees.

1. Problem Setting and Class–Concept Decoupling

Class-wise unlearning refers to the removal of all samples belonging to a particular class label from a model, such as “dog” in image classification. However, the “class” as defined in the model’s label space may not coincide with the underlying semantic or data-generating concept. This distinction gives rise to several mismatches:

All-matched: Reported, model, and true concept labels coincide.
Target-mismatch: The reported and model labels are coarser than the concept to be forgotten.
Model-mismatch: The true target to forget is coarser than the model’s label space.
Data-mismatch: Only a subportion of the true concept is reported and forgotten.

The data partitioning usually distinguishes:

$D_f$ : identified-to-be-forgotten subset (reported class)
$D_{un}$ $D_{u n}$ : remaining data, divided into
- $D_{uf}$ : concept-aligned but unidentified points
- $D_r$ : data truly retained (outside the target concept)

Robust class-unlearning must confront these mismatches and retain performance on unrelated classes and concepts, motivating frameworks such as Target-Aware Forgetting (TARF) (Zhu et al., 2024).

2. Theoretical Formulations and Optimization Frameworks

Approximate class unlearning is typically posed as a composite optimization problem: $\min_\theta\ U(\theta) \ \text{subject to}\ C(\theta)\le 0$ where $U(\theta)$ targets the removal (“forgetting”) of class knowledge, often via gradient ascent on the forgotten class, and $C(\theta)$ is a utility constraint to preserve performance on the retained set (Dine et al., 4 Nov 2025).

Several key theoretical principles have been introduced:

First-order feasible updates: Updates are masked to ensure parameter changes are aligned with both the unlearning and utility gradients, often via an AND-mask constructed from their signs. Noise-aware versions use a focus vector proportional to the estimated probability of sign agreement (Dine et al., 4 Nov 2025).
Alternating ascent/descent: TARF interleaves annealed gradient ascent on $D_f$ with selective descent on “safe” $D_{un}$ points identified by their representation-level change, formalized via a consistency score (Zhu et al., 2024).
Geometric considerations: Unlearning solution quality can be sensitive to the feature geometry of classes and the “representation gravity” of data points in embedding space (Zhu et al., 2024).

Empirical and theoretical guarantees focus on tightness to retrained models with respect to output distributions, membership inference attack (MIA) success rates, and possible parameter-space gaps.

3. Exemplary Algorithmic Frameworks

Several recent methods operationalize these theories:

Framework	Core Mechanism	Key Features
TARF (Zhu et al., 2024)	Annealed ascent on $D_f$ + selective descent	Robust to class/concept mismatches, representation-aware retaining
TRW (Ebrahimpour-Boroojeny, 7 Dec 2025)	Tilted target reweighting for class-unlearning	Closely matches retrain output probabilities, reduces MIA leakage
DELETE (Zhou et al., 31 Mar 2025)	Masked distillation with “dark knowledge”	Decomposes loss into forgetting/retaining, only needs access to $D_{un}$ 0
OPC (Jung et al., 10 Jul 2025)	One-point feature contraction for $D_{un}$ 1	Enforces deep feature forgetting, robust to inversion attacks
Orthogonal Soft Pruning (Gong et al., 24 Jun 2025)	Prunes class-specific channels	Near-instant unlearning, minimal accuracy loss, requires orthogonal pretraining
OUR (Xiao et al., 28 Jul 2025)	Dual-phase: orthogonal unlearning then replay	Removes deep residuals, closes “pseudo-convergence” attack surface

Each framework may focus on output-only alignment (e.g., TRW, DELETE), representation/projection-level constraints (e.g., TARF, OPC, Orthogonal Soft Pruning, OUR), or combinations thereof.

4. Empirical Performance and Benchmarks

Approximate class unlearning methods are evaluated primarily on CIFAR-10, CIFAR-100, Tiny-ImageNet, SVHN, and increasingly on generative models (e.g., Diffusion, Stable Diffusion) (Zhu et al., 2024, Gong et al., 24 Jun 2025, Ebrahimpour-Boroojeny, 7 Dec 2025). Metrics typically include:

Unlearning accuracy (UA): model performance (accuracy, FID, IoU) on the forgotten class/test set (lower is better)
Retaining accuracy (RA): model performance on the retained set (higher is better)
Test accuracy (TA): generalization on the full (retained) test set
Membership inference attack (MIA): ability to distinguish forgotten data (lower is better)
Gap to retrain: difference in UA/RA/MIA from reference model retrained from scratch

Notable quantitative findings include:

In all-matched settings, methods such as TARF, DELETE, and TRW attain gap to retrain $D_{un}$ 2 in RA and UA, and suppress MIA to retrain levels (Zhu et al., 2024, Zhou et al., 31 Mar 2025, Ebrahimpour-Boroojeny, 7 Dec 2025).
In mismatched scenarios (data, model, or target), TARF reduces the average metric gap by $D_{un}$ 3 to $D_{un}$ 4 versus naive gradient-ascent or fine-tuning baselines, which incur $D_{un}$ 5– $D_{un}$ 6 retention losses (Zhu et al., 2024).
Orthogonal Soft Pruning achieves complete forgetting (UA=0) with $D_{un}$ 7 loss in retained accuracy and sub-second unlearning time (Gong et al., 24 Jun 2025).
OPC uniquely provides feature-space “deep” forgetting, preventing inversion and feature recovery attacks more effectively than logit-only methods (Jung et al., 10 Jul 2025).
Advanced attacks such as Reminiscence Attack (ReA) (Xiao et al., 28 Jul 2025) demonstrate that shallow methods may leave recoverable residuals; frameworks such as OUR mitigate these effects by enforcing orthogonalization at multiple hidden layers.
TRW reduces MIA-NN and U-LiRA leakage compared to standard unlearning objectives, with minimal computational overhead (Ebrahimpour-Boroojeny, 7 Dec 2025).

5. Privacy Guarantees and Limitations

Privacy threat models: Most frameworks now consider both conventional output-based MIAs and adaptive attacks leveraging representation-level or parameter difference signals (e.g., ReA, MIA-NN) (Xiao et al., 28 Jul 2025, Ebrahimpour-Boroojeny, 7 Dec 2025).
Residual leakage: Even logit-aligned or zero-accuracy-unlearned models may suffer privacy leakage from deep representations unless specific countermeasures are in place (e.g., feature contraction, orthogonal residual destruction) (Zhu et al., 2024, Jung et al., 10 Jul 2025, Xiao et al., 28 Jul 2025).
No formal DP certificate: Most non-convex deep learning scenarios lack formal $D_{un}$ 8-style removal guarantees. Some methods, especially for linear/convex models, combine Newton-style updates with Gaussian noise to provide such certificates for specific settings (Mahadevan et al., 2021, Suriyakumar et al., 2022).

Limitations include reliance on hyperparameter tuning (e.g., forgetting strength, selection thresholds, pruning ratios), incomplete feature disentanglement in highly entangled representations, and the inability to produce strict retraining-equivalent guarantees. The effectiveness can degrade if the feature geometry is highly entangled or if label/data/report/target domains are severely misaligned.

6. Practical Guidelines and Open Problems

Tuning and deployment: Parameters such as unlearning strength, retaining selection thresholds, pruning ratios, and learning rates must be tuned. Several frameworks operate effectively in $D_{un}$ 9– $D_{uf}$ 0 epochs or fewer, and require only $D_{uf}$ 1 and, optionally, a frozen pretrained model (Zhu et al., 2024, Zhou et al., 31 Mar 2025).
Validation: Empirical verification via membership inference audits is essential. Monitoring UA, RA, and MIA under standard and adaptive attacks should accompany each deployment (Ebrahimpour-Boroojeny, 7 Dec 2025).
Compatibility and efficiency: Some methods require orthogonality regularization at training (Orthogonal Soft Pruning), while others are applicable as plug-ins to existing architectures and optimizers (TRW, TARF, feasible masking) (Dine et al., 4 Nov 2025, Gong et al., 24 Jun 2025).
Open research: Strong theoretical bounds on parameter proximity to retraining, certified privacy for deep non-convex models, unlearning without access to the retained dataset, efficient sub-sampling, and scalability to non-classification tasks (e.g., LLMs, diffusion models) remain active research areas (Zhu et al., 2024, Xiao et al., 28 Jul 2025).

7. Summary Table of Selected Approximate Class Unlearning Methods

Method	Main Principle	Empirical Gap to Retrain	MIA Suppression	Special Strength
TARF	Annealed ascent/descent + rep. consistency	1–3%	Yes	Handles label/domain mismatches
TRW	Tilted reweighting of logit targets	<1%	Yes	Reduces new NN/MIA attacks
DELETE	Mask distillation, “dark knowledge” retention	$D_{uf}$ 2	Yes	No $D_{uf}$ 3 needed; logit-based
OPC	Feature space contraction (deep forgetting)	$D_{uf}$ 4	Yes	Resists feature inversion/recovery
Soft Pruning	Orthogonalized, class-specific filter attenuation	$D_{uf}$ 5– $D_{uf}$ 6	Yes	Millisecond latency, low overhead
OUR	Orthogonalization + replay for deep residuals	$D_{uf}$ 7	Yes	Eliminates pseudo-convergence attacks

A plausible implication is that while approximate class unlearning methods now approach full retraining in fidelity and privacy retention, continued advances are needed for difficult concept/feature misalignments, certified removal in non-convex settings, and robust protection against adaptive residual attacks. Emerging hybrid approaches combining loss-space, representation-space, and parameter-space interventions offer the most robust privacy–utility tradeoffs currently observed.

References:

“Decoupling the Class Label and the Target Concept in Machine Unlearning” (Zhu et al., 2024)
“Improving Unlearning with Model Updates Probably Aligned with Gradients” (Dine et al., 4 Nov 2025)
“Toward Reliable Machine Unlearning: Theory, Algorithms, and Evaluation” (Ebrahimpour-Boroojeny, 7 Dec 2025)
“Decoupled Distillation to Erase: A General Unlearning Method for Any Class-centric Tasks” (Zhou et al., 31 Mar 2025)
“OPC: One-Point-Contraction Unlearning Toward Deep Feature Forgetting” (Jung et al., 10 Jul 2025)
“Orthogonal Soft Pruning for Efficient Class Unlearning” (Gong et al., 24 Jun 2025)
“Reminiscence Attack on Residuals: Exploiting Approximate Machine Unlearning for Privacy” (Xiao et al., 28 Jul 2025)