Papers
Topics
Authors
Recent
Search
2000 character limit reached

AGO Loss for Fine-Grained Unlearning

Updated 6 February 2026
  • Activation-Guided Orthogonal (AGO) Loss is a machine unlearning method that employs layer-local optimization to balance targeted forgetting with minimal performance loss.
  • It uses mutual information and PCA to select the least-entangled layer and combines contrastive unalignment with retention alignment using orthogonal gradient projection.
  • Empirical evaluations on LLMs show that AGO Loss effectively reduces sensitive knowledge while preserving key retained capabilities with minimal interference.

The Activation-Guided Orthogonal (AGO) Loss is a central mechanism for fine-grained machine unlearning introduced in the FALCON framework—Fine-grained Activation manipuLation by Contrastive Orthogonal uNalignment—for LLMs. Designed to address safety concerns arising from the inadvertent encoding of sensitive or harmful information, AGO Loss departs from prior coarse-grained approaches by operating at the resolution of individual network layers identified through mutual-information metrics. AGO Loss orchestrates the delicate balance between effective removal (unlearning) of targeted knowledge and rigid preservation of essential retained knowledge by integrating representation-guided parameter selection, contrastive gradient signals, and a principled orthogonalization of conflicting objectives in optimization (Hu et al., 3 Feb 2025).

1. Formal Definition of AGO Loss

AGO Loss operates at a single, strategically selected layer ll^*, parameterized by θl\theta^{l^*}. Rather than minimizing a naive weighted sum of two competing objectives—contrastive unalignment loss for the “forget” set (LF\mathcal{L}_F) and retention alignment loss for the “retain” set (LR\mathcal{L}_R)—AGO Loss interleaves their opposing gradients using orthogonal projection to approximate Pareto-optimal model updates. The update direction for θl\theta^{l^*} is given by:

θl=αProjLR(LF)+βLR\nabla \theta^{l^*} = \alpha \cdot \mathrm{Proj}_{\perp \nabla \mathcal{L}_R} (\nabla \mathcal{L}_F) + \beta \cdot \nabla \mathcal{L}_R

where ProjLR(LF)\mathrm{Proj}_{\perp \nabla \mathcal{L}_R} (\nabla \mathcal{L}_F) projects the unlearning gradient orthogonally to the retention gradient when their cosine similarity is negative, suppressing destructive interference [(Hu et al., 3 Feb 2025), Eqn. (1)]. Scalars α\alpha and β\beta weight unlearning and retention, respectively, and are dynamically adjusted to favor retention under high conflict.

2. Activation-Based Parameter Selection

AGO Loss is fundamentally layer-local. FALCON first identifies the “least-entangled” layer by computing mutual information (MI) between activation distributions of the forget and retain sets at each layer ll. Explicitly, let θl\theta^{l^*}0 and θl\theta^{l^*}1 denote activations for the forget and retain sets. Entropies θl\theta^{l^*}2, θl\theta^{l^*}3, and joint entropy θl\theta^{l^*}4 are estimated by kernel density estimation (KDE) with a Gaussian kernel and dimensionality reduction via PCA (retaining 95% variance). MI is calculated as:

θl\theta^{l^*}5

The target layer θl\theta^{l^*}6 for AGO Loss optimization is selected by minimizing θl\theta^{l^*}7 across layers:

θl\theta^{l^*}8

Empirically, early layers usually exhibit minimal MI, indicating greater disentanglement between forget and retain knowledge. This selective focus enables “surgical” parameter updates with reduced collateral utility degradation (Hu et al., 3 Feb 2025).

3. Contrastive Unalignment and Retention Terms

Contrastive Unalignment Loss (θl\theta^{l^*}9)

At the identified layer LF\mathcal{L}_F0, AGO Loss employs a contrastive mechanism to enforce separation between updated forget-set activations and their original principal subspaces. For each batch:

  • Obtain forget activations from the updated model (LF\mathcal{L}_F1) and from a frozen reference model (LF\mathcal{L}_F2).
  • Extract top-LF\mathcal{L}_F3 principal directions via SVD of LF\mathcal{L}_F4 (LF\mathcal{L}_F5, LF\mathcal{L}_F6 columns of LF\mathcal{L}_F7).
  • Generate a “principal offset” vector LF\mathcal{L}_F8 by pushing a random seed away from these subspaces:

LF\mathcal{L}_F9

where LR\mathcal{L}_R0 controls offset magnitude, LR\mathcal{L}_R1 is optional noise, and LR\mathcal{L}_R2 is a projection or nonlinearity.

  • Compute cosine similarities between anchor (updated activation), positive (offset), and negatives (other frozen forget activations). The loss uses InfoNCE:

LR\mathcal{L}_R3

with temperature LR\mathcal{L}_R4 empirically set to LR\mathcal{L}_R5 [(Hu et al., 3 Feb 2025), Eqn. (7)].

Retention Alignment Loss (LR\mathcal{L}_R6)

To preserve retained knowledge, AGO Loss imposes a cosine alignment constraint between updated and frozen retain-set activations:

LR\mathcal{L}_R7

This self-supervised retention term curtails drift on the retain set [(Hu et al., 3 Feb 2025), Eqn. (8)].

4. Orthogonal Projection and Gradient Conflict Resolution

The competing objectives inherent in machine unlearning—maximal forgetting while retaining existing utility—can manifest as conflicting gradients during joint optimization. AGO Loss explicitly identifies and mitigates such conflict:

  • Gradients of LR\mathcal{L}_R8 (LR\mathcal{L}_R9) and θl\theta^{l^*}0 (θl\theta^{l^*}1) are computed at θl\theta^{l^*}2.
  • The cosine similarity θl\theta^{l^*}3 flags potential conflict (negative value).
  • If a conflict is detected (θl\theta^{l^*}4), the unlearning gradient is projected onto the orthogonal complement of the retention gradient:

θl\theta^{l^*}5

  • The final update is a weighted sum:

θl\theta^{l^*}6

with θl\theta^{l^*}7 chosen as θl\theta^{l^*}8 when θl\theta^{l^*}9 and θl=αProjLR(LF)+βLR\nabla \theta^{l^*} = \alpha \cdot \mathrm{Proj}_{\perp \nabla \mathcal{L}_R} (\nabla \mathcal{L}_F) + \beta \cdot \nabla \mathcal{L}_R0 when θl=αProjLR(LF)+βLR\nabla \theta^{l^*} = \alpha \cdot \mathrm{Proj}_{\perp \nabla \mathcal{L}_R} (\nabla \mathcal{L}_F) + \beta \cdot \nabla \mathcal{L}_R1, prioritizing retention in the event of strong conflict [(Hu et al., 3 Feb 2025), Eqns. (10)-(11)].

5. Integration in FALCON Framework and Training Procedure

AGO Loss is integrated within FALCON’s end-to-end unlearning regime as follows:

  • For every layer θl=αProjLR(LF)+βLR\nabla \theta^{l^*} = \alpha \cdot \mathrm{Proj}_{\perp \nabla \mathcal{L}_R} (\nabla \mathcal{L}_F) + \beta \cdot \nabla \mathcal{L}_R2, mutual information θl=αProjLR(LF)+βLR\nabla \theta^{l^*} = \alpha \cdot \mathrm{Proj}_{\perp \nabla \mathcal{L}_R} (\nabla \mathcal{L}_F) + \beta \cdot \nabla \mathcal{L}_R3 is computed via KDE and PCA on forget and retain set activations.
  • The minimal-θl=αProjLR(LF)+βLR\nabla \theta^{l^*} = \alpha \cdot \mathrm{Proj}_{\perp \nabla \mathcal{L}_R} (\nabla \mathcal{L}_F) + \beta \cdot \nabla \mathcal{L}_R4 layer θl=αProjLR(LF)+βLR\nabla \theta^{l^*} = \alpha \cdot \mathrm{Proj}_{\perp \nabla \mathcal{L}_R} (\nabla \mathcal{L}_F) + \beta \cdot \nabla \mathcal{L}_R5 is selected, all other parameters are frozen.
  • Updates are performed solely on θl=αProjLR(LF)+βLR\nabla \theta^{l^*} = \alpha \cdot \mathrm{Proj}_{\perp \nabla \mathcal{L}_R} (\nabla \mathcal{L}_F) + \beta \cdot \nabla \mathcal{L}_R6 using a second-order optimizer (e.g., Sophia with θl=αProjLR(LF)+βLR\nabla \theta^{l^*} = \alpha \cdot \mathrm{Proj}_{\perp \nabla \mathcal{L}_R} (\nabla \mathcal{L}_F) + \beta \cdot \nabla \mathcal{L}_R7).
  • Each step: batch forget/retain examples, compute respective activations, evaluate θl=αProjLR(LF)+βLR\nabla \theta^{l^*} = \alpha \cdot \mathrm{Proj}_{\perp \nabla \mathcal{L}_R} (\nabla \mathcal{L}_F) + \beta \cdot \nabla \mathcal{L}_R8 (via SVD and InfoNCE), evaluate θl=αProjLR(LF)+βLR\nabla \theta^{l^*} = \alpha \cdot \mathrm{Proj}_{\perp \nabla \mathcal{L}_R} (\nabla \mathcal{L}_F) + \beta \cdot \nabla \mathcal{L}_R9, compute and project gradients as necessary, and apply the update.
  • Hyperparameters include ProjLR(LF)\mathrm{Proj}_{\perp \nabla \mathcal{L}_R} (\nabla \mathcal{L}_F)0 (for question answering tasks), ProjLR(LF)\mathrm{Proj}_{\perp \nabla \mathcal{L}_R} (\nabla \mathcal{L}_F)1 total update steps, ProjLR(LF)\mathrm{Proj}_{\perp \nabla \mathcal{L}_R} (\nabla \mathcal{L}_F)2 principal vectors, and standard Sophia optimizer settings (Hu et al., 3 Feb 2025).

6. Empirical Evaluation and Observed Effects

AGO Loss has been benchmarked on Zephyr-7B-Beta, Yi-6B-Chat, and Mistral-7B-Instruct LLMs. WMDP unlearning metrics (bio-score, cyber-score) drop substantially after AGO—e.g., for Zephyr-7B-Beta, bio-score decreases from ProjLR(LF)\mathrm{Proj}_{\perp \nabla \mathcal{L}_R} (\nabla \mathcal{L}_F)3 to ProjLR(LF)\mathrm{Proj}_{\perp \nabla \mathcal{L}_R} (\nabla \mathcal{L}_F)4—while held-out accuracy (MMLU) and perplexity (PPL) decrease by less than ProjLR(LF)\mathrm{Proj}_{\perp \nabla \mathcal{L}_R} (\nabla \mathcal{L}_F)5 (ProjLR(LF)\mathrm{Proj}_{\perp \nabla \mathcal{L}_R} (\nabla \mathcal{L}_F)6 MMLU, ProjLR(LF)\mathrm{Proj}_{\perp \nabla \mathcal{L}_R} (\nabla \mathcal{L}_F)7 PPL remain). Gradient conflict analysis confirms lower interference for MI-selected layers, and knowledge recovery attacks (e.g., Enhanced GCG, ProjLR(LF)\mathrm{Proj}_{\perp \nabla \mathcal{L}_R} (\nabla \mathcal{L}_F)8 steps) recover less than ProjLR(LF)\mathrm{Proj}_{\perp \nabla \mathcal{L}_R} (\nabla \mathcal{L}_F)9 of erased information. Ablative experiments show that omitting any AGO Loss component—contrastive unalignment, gradient projection, or principal-offset vectors—yields diminished unlearning or excessive utility loss [(Hu et al., 3 Feb 2025), Table 2-3].

7. Significance and Conceptual Implications

AGO Loss establishes a layer-local, representation-guided approach to machine unlearning that explicitly addresses the antagonism between forgetting and retention. Activation-space MI guidance confines edits to minimally entangled regions, contrastive unalignment systematically expels unwanted knowledge, and the orthogonalization scheme operationalizes a local Pareto-front compromise. The empirical results from FALCON demonstrate that AGO Loss delivers principled, robust, and efficient machine unlearning, suggesting that representation-aware, gradient-level conflict resolution may have broader implications for multi-objective optimization within neural networks (Hu et al., 3 Feb 2025).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Activation-Guided Orthogonal (AGO) Loss.