Attribute-Level Unlearning in ML Models

Updated 6 January 2026

Attribute-level unlearning is the selective removal of specific attribute information (e.g., demographics or factual data) from pre-trained models to render such attributes undetectable.
Methodologies include subspace extraction, adversarial techniques, and information-theoretic losses to minimize mutual information between model representations and unwanted attributes.
Empirical evaluations indicate strong forgetting efficacy with minimal impact on main-task performance, balanced through tailored regularization and modular adjustments.

Attribute-level unlearning is the systematic process of excising specific attributes—such as demographic variables, factual associations, or proprietary knowledge—from a pre-trained machine learning model, with the goal of rendering the presence or influence of these attributes undetectable in either the model’s output or internal representations, while preserving the utility for all other attributes and tasks. This paradigm extends beyond sample-wise unlearning, focusing on the removal of structured, cross-sample information. A large literature investigates attribute-level unlearning across representation learning, LLMs, and recommender systems, employing subspace, adversarial, distributional, and information-theoretic methods.

1. Problem Definition and Theoretical Foundations

Attribute-level unlearning seeks to selectively remove the information content pertaining to one or more attributes $z$ from model parameters or latent representations $h = f_w(x)$ , while maintaining the capability to perform on primary tasks associated with labels $y$ (Guo et al., 2022). The key desiderata are:

Selective attribute removal: Ensuring mutual information $I(h;z)$ is minimized.
Task fidelity: Preserving $I(h;y)$ at a high level so that main-task performance is unaffected.
Information bottleneck constraints: Preventing leakage of removed attributes via other correlated variables.

Formally, in representation learning, this leads to objectives of the form:

$\min_{w,\theta} -\lambda_1 I(h;x) - \lambda_2 I(h;y) - \lambda_3 I(h;z)$

where $\lambda_1, \lambda_2, \lambda_3$ determine the trade-off between compression, utility, and attribute removal (Guo et al., 2022). In recommender systems, the attribute-unlearning objective aligns the distributions of user embeddings $e_u$ across attribute classes, e.g., by minimizing maximum mean discrepancy (MMD) or mutual information between $e_u$ and $z$ , and regularizes against function-space drift for utility preservation (Chen et al., 2024).

In LLMs, the goal is to remove knowledge traces associated with specific attributes or facts, e.g., by identifying and excising low-rank subspaces corresponding to the attribute, or by orthogonalizing away attribute-relevant directions within the parameter space (Lizzo et al., 2024).

2. Methodological Approaches

2.1 Subspace-based Unlearning in LLMs

UNLEARN targets attribute or task-level knowledge by constructing low-rank “adapter” matrices $\{T_i^l\}$ specific to the target attribute ( $i$ ), for each Transformer layer $l$ (Lizzo et al., 2024). Forgetting is effectuated by

$\widehat{W} = W - T_i'$

where $T_i'$ is a Gram–Schmidt-orthogonalized, discrimination-corrected version, ensuring that the removal does not impact similar subspaces associated with non-target attributes. This subspace can also be added back (“LEARN”) for targeted knowledge insertion.

2.2 Information-Theoretic and Distributional Methods

In representation learning for vision or tabular tasks, attribute-level unlearning leverages mutual information minimization. The infoFiltra objective enforces selective filtering at training time, directly minimizing $I(h;z)$ (attribute information) with tractable variational bounds, subject to bottleneck and utility-preservation constraints (Guo et al., 2022). FastFiltra provides an acceleration scheme for high-dimensional settings.

For recommender systems, distinguishability losses based on MMD or user-to-user distances are post-training applied to the user embedding matrix, often combined with function-space (rank-list) regularization to limit utility degradation (Li et al., 2023, Chen et al., 2024). Multi-attribute extensions such as LEGO alternate per-attribute embedding calibration (parallelized mutual information minimization) and a flexible aggregation step, optimizing mixture coefficients to jointly remove all targeted attributes (Yu et al., 23 Oct 2025).

2.3 Adversarial and Gradient-based Approaches

Adversarial training (e.g., with a gradient reversal layer) is widely used for embedding-level attribute removal in both centralized and federated settings. FedAU² introduces adaptive per-user adversarial triggers and robust gradient obfuscation using dual-stochastic VAEs to impede gradient-based attribute reconstruction, crucial for practical privacy-guaranteed federated recommenders (Li et al., 28 Nov 2025).

2.4 Discrete Output-Space (MCQ) and Token-based LLM Unlearning

DF-MCQ constrains LLM unlearning to multiple-choice questions, flattening the model’s predictive distribution over attribute-specific MCQs via KL-divergence to uniform, ensuring no single option (including the true attribute) remains favored, and thus triggering consistent refusal behavior (Sun et al., 5 May 2025). UniErase operates by learning a special unlearning token [UNL], optimized to rout attribute queries to “I don’t know” style refusals, followed by local model edits that ensure attribute-specific prompts are deterministically mapped into the ignored region, modifying only a small fraction of weights for efficiency (Yu et al., 21 May 2025).

3. Evaluation Metrics and Empirical Benchmarks

Attribute-level unlearning efficacy is measured along several axes:

Forget efficacy: Relative accuracy drop or increased uncertainty on the target attribute/task/QA set (e.g., >96% accuracy loss on forgotten GSM8K task for UNLEARN (Lizzo et al., 2024), high MCQ entropy and refusal rate for DF-MCQ (Sun et al., 5 May 2025)).
Retention efficacy: Change in utility on non-target tasks or attribute sets; ideally, degradation is minimal (e.g., <2.5% drop in UNLEARN (Lizzo et al., 2024)).
Indistinguishability: Balanced accuracy or attack success rate of external attribute inference (targeting 0.5 for random guess in binary setting; achieved in PoT-AU (Li et al., 2023, Chen et al., 2024)).
Collaterality and selectivity: Norm-AUC and absolute AUC trade-off curves assess whether forgetting can be precisely targeted without unintended knowledge loss (Wu et al., 18 Jun 2025).
Computational efficiency: Number of gradient steps, parallelizability, and wall-clock time versus retraining or full RLHF (e.g., attribute-level unlearning can use ~2% of RLHF time (Yao et al., 2023)).

4. Empirical Findings and Best Practices

Trade-offs: There is a consistent trade-off between attribute removal and utility; strong removal parameters (e.g., larger adapter rank or lower regularization) yield better forgetting but risk degrading main-task performance or inducing leakage (Lizzo et al., 2024, Guo et al., 2022).
Learning-time encoding matters: Attribute representations that are paraphrased or explicitly separated during training are easier to unlearn (Wu et al., 18 Jun 2025). Highly entangled or chunked attributes suffer from collateral forgetting—even isolating a single fact is non-trivial if it is embedded within a dense paragraph.
Choice of loss and regularizer: In recommender settings, distribution-to-distribution MMD-based losses offer greater robustness than user-to-user distance methods and are effective against both linear and non-linear attribute inference attacks (Li et al., 2023, Chen et al., 2024).
Scalability and adaptability: Modular two-stage approaches (e.g., LEGO), as well as mechanisms for batching or parallelizing per-attribute removal steps, enable adaptation to dynamic or multi-attribute removal requirements with near-constant overhead (Yu et al., 23 Oct 2025).
Adversarial defense: Robustness to gradient inversion and query-level attribute recovery can be enhanced via injective stochasticity (dual-VAEs) and selective gradient reversal, particularly critical in federated contexts (Li et al., 28 Nov 2025).

5. Limitations and Open Challenges

Subspace overlap: If the target attribute subspace is largely contained within a protected or utility subspace, discrimination procedures cannot achieve perfect separation; some collateral loss is unavoidable (Lizzo et al., 2024).
Negative attributes and model collapse: Simple gradient-ascent unlearning risks global inutility or collapse without regularization, especially for open-ended generators; careful balance of loss coefficients is essential (Yao et al., 2023).
Scaling to many attributes: As the number of simultaneous attributes grows, targeted removal becomes more difficult. Aggregating attribute-unlearning jobs or optimizing flexible combination (as in LEGO) can partially mitigate this but may eventually lead to degraded indistinguishability or utility (Yu et al., 23 Oct 2025).
Evaluation and probing: Differentiating between true unlearning and obfuscation (which simply dilutes but does not remove knowledge) requires specialized probing frameworks (e.g., MCQs, refusal rates); nominal accuracy or entropy metrics may not reveal persistent shortcuts (Sun et al., 5 May 2025).

6. Extensions, Practical Guidelines, and Future Directions

Direct application to LLMs: Layerwise subspace unlearning, MCQ- or token-triggered model editing, and information-theoretic objectives have been concretely instantiated in LLMs (UNLEARN, DF-MCQ, UniErase), each offering different trade-offs in selectivity, efficiency, and interpretability (Lizzo et al., 2024, Sun et al., 5 May 2025, Yu et al., 21 May 2025).
Multiple-attribute and dynamic unlearning: Frameworks such as LEGO formalize multi-attribute unlearning as a mutual information minimization within a local trust region, split into parallelizable calibration and flexible combination steps, yielding both theoretical guarantees and dynamic adaptability (Yu et al., 23 Oct 2025).
Best practices: Training-time paraphrasing and attribute isolation, careful selection of regularization scale, empirical tuning of forgetting-retaining trade-offs, and use of function-space loss for utility preservation are recurrently validated choices (Wu et al., 18 Jun 2025, Chen et al., 2024).
Open problems: Robust multi-attribute removal, formal convergence guarantees for adversarial federated unlearning, defense against increasingly powerful attack models, and integration into real-world privacy governance pipelines are active research frontiers.

7. Representative Methods and Quantitative Results

Method	Forgetting Efficacy (Target)	Retention (Utility)	Scalability	Domain
UNLEARN (Lizzo et al., 2024)	~96% task accuracy drop	<2.5% accuracy drop	Linear in layer count	LLMs
FastFiltra (Guo et al., 2022)	30–50% attribute classifier accuracy drop	<1% main-task loss	3× faster, 40% less GPU	Vision, Tabular
PoT-AU D2D (Li et al., 2023, Chen et al., 2024)	AUC ≈ 0.5 (random guess)	≤2% NDCG/HR drop	10× faster than retrain	Recommender
LEGO (Yu et al., 23 Oct 2025)	BAcc ≈ 28% (vs 42% orig)	NDCG@10 drop ≈ 1.16%	Constant in # attributes	Multi-attribute RecSys
UniErase (Yu et al., 21 May 2025)	FE ≈ 86.4, RE ≈ 73.6	>75% QA accuracy retained	3–4% weight change, 10× faster	Fine-grained LLM
DF-MCQ (Sun et al., 5 May 2025)	>90% refusal rate, MCQ Entropy ≈ ln C	Retain unchanged	Single GPU, minutes	LLM factual unlearning
FedAU² (Li et al., 28 Nov 2025)	BAcc ↓ by 26.4% (gender)	NDCG@10 loss ≈ 4.5%	Small overhead, SUT stable	Federated recommender

Each approach explicitly couples attribute erasure with quantitative control of utility and scalability. Selection among approaches is determined by model type, attribute structure, scale requirements, and practical privacy or compliance constraints.