Info-Theoretic Framework for Attribute Unlearning

Updated 6 January 2026

The paper introduces an information-theoretic objective that minimizes mutual information between learned representations and sensitive attributes while preserving task-relevant utility.
It employs surrogate losses, such as variational bounds, MMD, adversarial and contrastive methods, to efficiently estimate and control the attribute's statistical footprint.
Practical schemes like LEGO and MaSS demonstrate strong empirical trade-offs, significantly reducing attribute inference without substantial loss in main-task performance.

Selective removal of sensitive or nuisance attributes from learned representations, known as attribute unlearning, has emerged as a fundamental requirement for privacy, fairness, and compliance in machine learning systems. The information-theoretic framework for attribute unlearning formalizes this objective as the selective reduction of statistical dependence—typically measured via Shannon mutual information—between target representations and the attributes to be forgotten, while ensuring maximal preservation of utility-relevant information. This paradigm provides mathematically rigorous objectives, optimization schemes, and guarantees, enabling principled design of algorithms that operate in settings as diverse as recommender systems, federated models, deep feature spaces, and multi-modal neural architectures.

1. Formal Information-Theoretic Objectives for Attribute Unlearning

Attribute unlearning is cast as an optimization over the encoder or data transformation parameters to maximize the retained information about task-relevant signals and inputs, while minimizing the mutual information with sensitive attributes. Let $x\in \mathcal X$ denote the original input, $y\in \mathcal Y$ the main task label, $z\in \mathcal Z$ the attribute to be unlearned, and $h = f_\theta(x)$ the learned representation.

The canonical information-theoretic objective is: $\max_\theta\ \mathcal U(h, y) \quad \text{subject to } I(h; z) \leq \varepsilon$ where $\mathcal U$ is a utility functional such as $I(h; y)$ (task-fidelity) or $I(h; x)$ (input preservation), and $I(h; z)$ quantifies the attribute's residual footprint. In soft-constraint form, this becomes: $\max_{\theta}\ \mathcal U(h, y)\ -\ \gamma I(h; z)$ for a Lagrange parameter $\gamma\geq 0$ (Guo et al., 2022, Xu et al., 8 Feb 2025).

In frameworks supporting multi-attribute unlearning (e.g., MaSS), the problem further generalizes to: $\begin{aligned} \text{maximize}_{\theta,\eta} \quad & I(X'; F)\ \text{subject to} \quad & I(X'; S_i) \leq m_i\ \forall i\ & I(X'; U_j) \geq n_j\ \forall j \end{aligned}$ where $X'$ is the transformed sample, $\{S_i\}$ are sensitive attributes, $\{U_j\}$ utility attributes, $F$ denotes unannotated generic information, and $m_i, n_j$ are per-attribute budget parameters (Chen et al., 2024). Attribute unlearning in federated or distributed models defines analogous criteria with respect to the information Fisher-score of model parameters about attributes or client data (Balordi et al., 26 Aug 2025).

2. Surrogate Losses and Mutual Information Estimation

Direct estimation of mutual information between high-dimensional representations and discrete/categorical attributes is generally intractable. Practical frameworks employ variational upper bounds or proxy divergences.

Upper bounds via Variational Classifiers: The vCLUB bound, for attribute $A_t$ and embedding $\theta$ , is (Yu et al., 23 Oct 2025):

$I(\theta; A_t) \leq \mathbf{E}_{p(\theta, a)} [\log q_\phi(a | \theta)] - \mathbf{E}_{p(\theta)} \mathbf{E}_{p(a)} [\log q_\phi(a | \theta)]$

for a learned variational classifier $q_\phi$ .

Distributional Divergence (MMD): In kernel-based approaches (e.g., post-training attribute unlearning), distinguishability is measured by the squared Maximum Mean Discrepancy between attribute-conditional embedding distributions (Chen et al., 2024, Li et al., 2023):

$\mathrm{MMD}^2(\mathbb{P}_1, \mathbb{P}_2) = \|\mu(\mathbb{P}_1) - \mu(\mathbb{P}_2)\|^2_{\mathcal{G}}$

where $\mathbb{P}_i$ is the distribution of embeddings conditioned on attribute class $i$ , and $\mathcal{G}$ is an RKHS.

Adversarial and Contrastive Surrogates: Adversarial classifiers (cross-entropy losses) and contrastive InfoNCE-type objectives are used to penalize mutual information with certain attributes while maximizing predictive information for utility attributes (Chen et al., 2024).
Jacobian-Norm Minimization: In local information-gain frameworks, the squared norm of the network Jacobian along the attribute axes is minimized, suppressing the transfer of information about the forgotten subspace (Foster et al., 2024):

$L_{\mathrm{attr}}(x; \theta) = \frac{1}{N} \sum \frac{\|f_\theta(x_a, x_u) - f_\theta(x_a + \delta_a, x_u)\|_2^2}{\|\delta_a\|_2^2}$

serving as a first-order proxy for $I(x_a; f_\theta(x) | x_u)$ .

3. Unified Optimization and Algorithmic Strategies

Multiple algorithmic instantiations arise for optimizing surrogate losses subject to information constraints:

Training-phase MI regularization: MI-based regularizers are integrated into the representation learning process, with closed-form bounds (e.g., InfoFiltra) supporting stepwise surrogate minimization (Guo et al., 2022).
Post-training Editing: Models can be edited post hoc, by optimizing only the embedding matrix or feature layers using distributional or adversarial losses, combined with functional/parameter-space regularization to protect utility (Li et al., 2023, Chen et al., 2024).
Two-step Procedures and Combinatorial Unlearning: Multi-attribute settings exploit parallelizable attribute-wise calibration steps, followed by a convex combination search to align embeddings to minimize information leakage on all targeted attributes (Yu et al., 23 Oct 2025).

A summary of MI estimation and surrogate tasks appears below:

MI estimation	Optimization phase	Approach
vCLUB bound	Training/post-train	Variational MI
MMD (& barycenter)	Post-train	RKHS divergence
Adversarial CE	Training/post-train	Classifier-based
InfoNCE contrastive	Both	Mutual info proxy
Jacobian norm	Post-train	Output sensitivity

4. Theoretical Guarantees and Trade-off Bounds

Rigorous bounds characterize Pareto trade-offs between utility retention and attribute removal:

Surrogate Loss Tightness: For InfoFiltra, it is shown that the optimized upper-bound loss $\mathcal{L}_{\mathrm{infoFiltra}}$ satisfies (Guo et al., 2022):

$\mathcal{L}_{\mathrm{infoFiltra}} \leq -\lambda_1 I(h; x) - \lambda_2 I(h; y) - \lambda_3 I(h; z)$

and under additional conditions, the slack $\epsilon$ is controlled by:

$\epsilon \leq \alpha\beta [I(x; y) - I(h; y) - I(h; z)]$

Feasibility Bounds: In selective suppression with utility lower bounds and sensitive-attribute upper bounds, the existence of a feasible mapping requires (Chen et al., 2024):

$n_j \leq m_i + I(X; U_j | S_i),\quad n_j \leq I(X; U_j),\quad m_i \geq 0$

and for unannotated utility, the achievable retained information is upper-bounded by $H(X|S_i) + m_i$ .

Distributional Pareto Frontier: In distributional unlearning, the Gaussian Pareto frontier of minimum deletions versus preservation is analytically given by (Allouah et al., 20 Jul 2025):

$\mathrm{PF}(P_U, P_R; \mathcal P) = \left\{ \left(\alpha, (\sqrt{\alpha} - \sqrt{D})^2 \right): \alpha \geq D \right\}$

with $D = D_{KL}(P_U \| P_R)$ .

Federated Unlearning: Per-parameter Target Information Score (TIS) quantifies the impact of unlearning:

$\mathrm{TIS}(\theta_i) = \left( \frac{\partial_{\theta_i}^2 \mathcal{L}(D_T)}{\partial_{\theta_i}^2 \mathcal{L}(D)} \right)^2$

Resetting high-TIS parameters followed by minimal retraining provably increases the adversary's error in inferring attribute presence (Balordi et al., 26 Aug 2025).

5. Practical Schemes and Empirical Evidence

Frameworks instantiate the above theory in various modalities and tasks:

LEGO: Employs parallelizable embedding calibration and a flexible convex-combination layer to support simultaneous/dynamic multi-attribute removal. It minimizes variational MI upper bounds subject to $ℓ_2$ -proximity constraints, with theoretical approximation-ratio guarantees. Empirically, LEGO achieves a $~24$ point drop in attribute-inference attacker BAcc and $<4\%$ NDCG@10 loss (Yu et al., 23 Oct 2025).
MaSS: Introduces adversarial cross-entropy on labeled attributes, contrastive InfoNCE for unannotated utility, and normalizes performance via "Normalized Accuracy Gain"; it outperforms alternatives across audio, image, and sensor datasets (Chen et al., 2024).
PoT-AU/D2D-FR: Distributional metric unlearning using MMD for class separation enables post hoc attribute removal at high efficiency (seconds vs retrain), with attacker BAcc reduced to chance levels and only $1$– $2\%$ drop in HR@10/NDCG@10 (Chen et al., 2024, Li et al., 2023).
Jacobian Smoothing: Just-in-Time attribute unlearning minimizes local information gain along attribute axes, yielding fast, output-level invariance to target attributes (Foster et al., 2024).
Wasserstein Barycenter: Feature-unlearning via optimal transport of attribute-conditional distributions to the barycenter gives a unified, closed-form solution for multiple objectives and admits efficient computation with neural OT maps (Xu et al., 8 Feb 2025).
Federated Unlearning: Hessian-diagonal scoring and targeted re-initialization generalizes attribute unlearning to federated and distributed settings (Balordi et al., 26 Aug 2025).

Empirical evaluations consistently show strong reduction in attribute inference, little degradation to main task metrics, and vastly improved computational and deletion efficiency relative to blind retraining or adversarial in-training unlearning.

6. Assessment, Metrics, and Generalization

Mutual-information–based metrics serve both as unlearning objectives and post hoc assessment tools:

Information Difference Index (IDI): Quantifies retained mutual information about forgotten attributes in intermediate layers; $IDI = 0$ indicates full erasure, $IDI=1$ no removal (Jeon et al., 2024). IDI leverages InfoNCE-based estimation and robustly separates truly unlearned models from those only masking outputs.
Trade-off analyses: Many frameworks present empirical utility–privacy Pareto frontiers, exhibiting minimal main-task loss up to near-complete attribute suppression (Yu et al., 23 Oct 2025, Chen et al., 2024, Allouah et al., 20 Jul 2025).
Extension to Arbitrary Attributes: All frameworks support multi-class, continuous, and even arbitrarily structured attribute unlearning by adapting the MI estimators, proxy losses, and regularization strengths accordingly (Chen et al., 2024, Yu et al., 23 Oct 2025, Xu et al., 8 Feb 2025).

7. Open Problems and Significance

While information-theoretic frameworks for attribute unlearning deliver closed-form objectives, surrogate bounds, and post hoc metrics, some limitations persist. These include the computational cost of high-dimensional MI estimation, the explicit-knowledge assumption about sensitive attributes for labelled suppression, and the non-convexity of deep representation mapping spaces. Nonetheless, these frameworks supply the first rigorous foundations for selective attribute removal, unifying disparate strategies under the powerful machinery of mutual information, and enabling efficient, theoretically justified implementations across a range of modern machine learning architectures (Guo et al., 2022, Yu et al., 23 Oct 2025, Chen et al., 2024, Chen et al., 2024, Xu et al., 8 Feb 2025, Allouah et al., 20 Jul 2025, Jeon et al., 2024, Balordi et al., 26 Aug 2025).