Representation Erasure in Machine Learning

Updated 6 April 2026

Representation erasure is the process of removing specific, sensitive, or harmful information from model representations while maintaining utility for unrelated tasks.
Techniques range from linear methods like orthogonal projections to advanced nonlinear strategies such as kernelized transformations and iterative adversarial minimization.
This approach supports fairness, privacy, and compliance by effectively balancing the trade-off between erasing sensitive attributes and retaining necessary information, despite challenges like dimensionality and hyperparameter sensitivity.

Representation erasure refers to the process of removing or neutralizing specific information—often semantic concepts, sensitive user data, or harmful affordances—from learned representations in neural networks or data-driven models. The objective is to ensure that the residual representation contains as little decodable information about the designated attribute (“concept”) as possible, while retaining maximum utility for all unrelated tasks. In modern research, representation erasure is a unifying principle for fairness (removal of protected attributes), privacy (machine unlearning), model interpretability, safety (harmful content suppression), and compliance (e.g., GDPR requests).

1. Formal Definitions and Principles

At its core, representation erasure can be precisely stated in information-theoretic terms. Let $X$ denote the original representation (e.g., hidden layer output, embedding), $C$ the concept/sensitive attribute to erase, and $Y$ the utility-relevant label. The canonical objective is:

$\max_f\quad I(Z;Y)\quad \text{s.t.}\quad I(Z;C)\leq \epsilon,\qquad Z = f(X)$

where $f$ is the erasure map, $I(\cdot\,; \cdot)$ is the mutual information, and $\epsilon$ is a tolerance parameter (zero for perfect erasure) (Chowdhury et al., 25 Mar 2025). A task-agnostic specialization sets $Y = X$ and maximizes the retained information about the original input subject to erasure of $C$ . In machine unlearning (user data erasure), the “concept” $C$ is the set of examples to be forgotten, and the goal is $C$ 0, $C$ 1 for the erased data $C$ 2 (Wang et al., 27 Feb 2025).

Guardedness generalizes this notion to a class $C$ 3 of adversaries. $C$ 4 is said to $C$ 5-guard $C$ 6 if no adversary in $C$ 7 can distinguish $C$ 8 from $C$ 9 significantly better than chance (Belrose et al., 2023).

2. Algorithmic Strategies for Erasure

2.1 Linear Erasure

Methods for linear erasure construct a (typically orthogonal) projection $Y$ 0 that annihilates all signal about the concept from $Y$ 1 (Ravfogel et al., 2022, Belrose et al., 2023). For $Y$ 2 the class of linear maps, the problem becomes finding $Y$ 3 of minimal rank such that any $Y$ 4 yields $Y$ 5 independent of $Y$ 6. LEACE gives a closed-form, least-squares solution that is simultaneously optimal for all $Y$ 7-norms (oblique projection in whitened space) (Belrose et al., 2023). R-LACE frames the problem as a maximin game, providing closed-form or convex-relaxed projectors (Ravfogel et al., 2022).

2.2 Nonlinear Erasure

Linear projections leave non-linear recoverable traces. Kernelized erasure extends the concept to a reproducing-kernel Hilbert space (RKHS), with the erasure operator $Y$ 8 acting as a projection or transformation in feature space to prevent any classifier in the associated RKHS from predicting the concept (Ravfogel et al., 2022).

KRaM (Kernelized Rate-Distortion Maximizer) generalizes further: it learns a non-linear transformation $Y$ 9 by maximizing a kernelized rate-distortion objective that forces representations of similar conceptual label to become dissimilar in the learned space, while maintaining overall geometric alignment with the original space (Chowdhury et al., 2023).

$\max_f\quad I(Z;Y)\quad \text{s.t.}\quad I(Z;C)\leq \epsilon,\qquad Z = f(X)$ 0

where $\max_f\quad I(Z;Y)\quad \text{s.t.}\quad I(Z;C)\leq \epsilon,\qquad Z = f(X)$ 1 rewards separation of concept-similar points and $\max_f\quad I(Z;Y)\quad \text{s.t.}\quad I(Z;C)\leq \epsilon,\qquad Z = f(X)$ 2 constrains the overall information volume.

LEOPARD introduces non-linear, density-matching orthogonal projections using maximum mean discrepancy (MMD): the pushforward of all class-conditional densities is forced to become indistinguishable under a characteristic kernel (Saillenfest et al., 16 Jul 2025). Cascade procedures combine linear and non-linear removal.

Obliviator formalizes complete independence against all non-linear adversaries, using iterative HSIC minimization in characteristic RKHSs, supported by a morphing/approximation algorithm that quantifies the erasure-utility cost curve (Akbari et al., 8 Mar 2026).

2.3 Concept Erasure in Structured and Unsupervised Contexts

AMSAL solves for alignment and joint projections when the attribute to be erased is not aligned at the instance level but only observed via aggregate statistics or group centroids, using a hard EM-like optimization alternating between assignment and spectral maximization (Shao et al., 2023). CURE applies a clustering-based, unsupervised framework in the unlearning context to facial recognition, relying on centroid-guided pseudo-labels and margin-based losses when explicit identity labels are unavailable (Shivam et al., 23 Sep 2025).

3. Information-Theoretic Limits, Trade-offs, and Guarantees

The fundamental limits of erasure are characterized by information bottleneck theory and data-processing inequalities (Chowdhury et al., 25 Mar 2025). Under the constraint $\max_f\quad I(Z;Y)\quad \text{s.t.}\quad I(Z;C)\leq \epsilon,\qquad Z = f(X)$ 3, the retained utility for $\max_f\quad I(Z;Y)\quad \text{s.t.}\quad I(Z;C)\leq \epsilon,\qquad Z = f(X)$ 4 is bounded as $\max_f\quad I(Z;Y)\quad \text{s.t.}\quad I(Z;C)\leq \epsilon,\qquad Z = f(X)$ 5; that is, only the conditional entropy given the concept survives. Perfect erasure— $\max_f\quad I(Z;Y)\quad \text{s.t.}\quad I(Z;C)\leq \epsilon,\qquad Z = f(X)$ 6—is achievable if the concept and non-concept supports are suitably disjoint or permutation-equivalent, and the optimal function $\max_f\quad I(Z;Y)\quad \text{s.t.}\quad I(Z;C)\leq \epsilon,\qquad Z = f(X)$ 7 is then a permutation or coupling that ensures $\max_f\quad I(Z;Y)\quad \text{s.t.}\quad I(Z;C)\leq \epsilon,\qquad Z = f(X)$ 8 for all $\max_f\quad I(Z;Y)\quad \text{s.t.}\quad I(Z;C)\leq \epsilon,\qquad Z = f(X)$ 9 (Chowdhury et al., 25 Mar 2025).

Erasure-utility trade-offs are captured in the "erasure funnel": utility is preserved up to $f$ 0 without any concept leakage, while further increases in utility require a sacrifice in erasure (i.e., a nonzero $f$ 1). The cost of nonlinear guardedness—how much utility is lost for complete independence—is empirically mapped for the first time by algorithms such as Obliviator (Akbari et al., 8 Mar 2026).

4. Representation Erasure in Deep Model Editing and Machine Unlearning

Representation erasure plays a central role in modern model editing and unlearning. In machine unlearning, CRFU (Compressive Representation Forgetting Unlearning) leverages the information bottleneck to compress representations and then minimally perturbs them to erase a specified subset $f$ 2, explicitly balancing a mutual information-based forgetting loss against a remembering constraint, modulated by an "unlearning rate" $f$ 3 (Wang et al., 27 Feb 2025). On MNIST, CRFU raises reconstruction MSE against known membership attacks by 200% with only $f$ 4 accuracy drop.

For LLMs, adversarially invariant feature learning underpins frontend approaches, but REPO (Representation Erasure-based Preference Optimization) extends erasure to the entire sequence via token-level, domain-adversarial mechanisms, forcing features of dispreferred outputs (e.g., toxic content) to match their benign analogs and using anchoring to preserve general language capability (Sepahvand et al., 24 Feb 2026). In knowledge forgetting, KIF (Knowledge Immunization Framework) targets the internal activation signatures, distinguishing genuine erasure (removal of latent signatures) from surface-level obfuscation and achieving near-oracle forgetting (FQ ≈ 0.99) while matching utility to upper bounds (MU ≈ 0.62) (Mahmood et al., 15 Jan 2026).

Generalizing further, perfect erasure functions (PEFs) as in (Chowdhury et al., 25 Mar 2025) are information-theoretically optimal, but may require access to full conditional distributions and can be computationally intractable in high dimensions.

5. Applications in Generative Models and Information Retrieval

Representation erasure is fundamental to modern concept erasure in diffusion models and generative image architectures, where the need is to suppress copyrighted or harmful visual concepts upon demand. Modular, scalable approaches such as DyME dynamically compose orthogonal LoRA adapters only for the requested erased concepts per prompt, enforcing bi-level feature and parameter orthogonality for robust, multi-concept erasure (Liu et al., 25 Sep 2025). EraseAnything++ frames erasure as a multi-objective constrained optimization solved by implicit gradient surgery and LoRA parameter tuning, extending to long-horizon video and transformer architectures (Fan et al., 1 Mar 2026). Prototype-Guided Erasure formulates broad (e.g., "sexual", "violent") concept removal via clustering in CLIP space and negative guidance at inference, generalizing training-free techniques for image/text prompts (Cai et al., 9 Mar 2026).

In information retrieval, erasure acquires a quantum-theoretic interpretation, where projector-like erasure operators $f$ 5 correspond to lexical proximity or presence measurements in a tokenized Hilbert space; composite queries and logical constraints are expressed as products and combinations of commuting/non-commuting erasers (0802.1738).

6. Experimental Evaluation and Limitations

Evaluation protocols for representation erasure span adversarial accuracy (e.g., logistic or nonlinear probe before and after erasure), mutual information or MMD metrics, kNN overlap for alignment, fairness audits (TPR-Gap, demographic parity), inverse reconstruction (MSE), and and synthetic gold-standard benchmarks (e.g., TOFU Forget10 in LLMs (Mahmood et al., 15 Jan 2026)). Typical results demonstrate that advanced nonlinear and information-theoretic methods eradicate adversarial accuracy to chance with minor downstream utility loss; linear approaches, though interpretable, fail under nonlinear probing (Saillenfest et al., 16 Jul 2025, Akbari et al., 8 Mar 2026).

Limitations include the curse of dimensionality in density estimation for PEFs, hyperparameter sensitivity in kernelized and modular approaches, and tight entanglement between main and protected attributes (causing unavoidable utility loss) (Shao et al., 2023). Perfect independence (zero mutual information) is sometimes information-theoretically unattainable, as characterized by principal inertia components and data-processing inequalities (Chowdhury et al., 25 Mar 2025).

7. Significance, Open Problems, and Future Directions

Representation erasure is now integral to database privacy, fairness-aware ML, neural model interpretability, selective unlearning, safety alignment, and controllable generation. Open problems include achieving robust, computationally efficient, universal non-linear guardedness (erasure against all adversaries), scalable density estimation for perfec erasure in high dimensions, streaming and continual erasure, and modular, dynamically composable forget mechanisms in large-scale multimodal and continual learning systems. The field continues to evolve, developing theoretically guided methodologies that reconcile practical constraints with provable privacy, safety, and fairness guarantees.