AGO Loss for Fine-Grained Unlearning
- Activation-Guided Orthogonal (AGO) Loss is a machine unlearning method that employs layer-local optimization to balance targeted forgetting with minimal performance loss.
- It uses mutual information and PCA to select the least-entangled layer and combines contrastive unalignment with retention alignment using orthogonal gradient projection.
- Empirical evaluations on LLMs show that AGO Loss effectively reduces sensitive knowledge while preserving key retained capabilities with minimal interference.
The Activation-Guided Orthogonal (AGO) Loss is a central mechanism for fine-grained machine unlearning introduced in the FALCON framework—Fine-grained Activation manipuLation by Contrastive Orthogonal uNalignment—for LLMs. Designed to address safety concerns arising from the inadvertent encoding of sensitive or harmful information, AGO Loss departs from prior coarse-grained approaches by operating at the resolution of individual network layers identified through mutual-information metrics. AGO Loss orchestrates the delicate balance between effective removal (unlearning) of targeted knowledge and rigid preservation of essential retained knowledge by integrating representation-guided parameter selection, contrastive gradient signals, and a principled orthogonalization of conflicting objectives in optimization (Hu et al., 3 Feb 2025).
1. Formal Definition of AGO Loss
AGO Loss operates at a single, strategically selected layer , parameterized by . Rather than minimizing a naive weighted sum of two competing objectives—contrastive unalignment loss for the “forget” set () and retention alignment loss for the “retain” set ()—AGO Loss interleaves their opposing gradients using orthogonal projection to approximate Pareto-optimal model updates. The update direction for is given by:
where projects the unlearning gradient orthogonally to the retention gradient when their cosine similarity is negative, suppressing destructive interference [(Hu et al., 3 Feb 2025), Eqn. (1)]. Scalars and weight unlearning and retention, respectively, and are dynamically adjusted to favor retention under high conflict.
2. Activation-Based Parameter Selection
AGO Loss is fundamentally layer-local. FALCON first identifies the “least-entangled” layer by computing mutual information (MI) between activation distributions of the forget and retain sets at each layer . Explicitly, let 0 and 1 denote activations for the forget and retain sets. Entropies 2, 3, and joint entropy 4 are estimated by kernel density estimation (KDE) with a Gaussian kernel and dimensionality reduction via PCA (retaining 95% variance). MI is calculated as:
5
The target layer 6 for AGO Loss optimization is selected by minimizing 7 across layers:
8
Empirically, early layers usually exhibit minimal MI, indicating greater disentanglement between forget and retain knowledge. This selective focus enables “surgical” parameter updates with reduced collateral utility degradation (Hu et al., 3 Feb 2025).
3. Contrastive Unalignment and Retention Terms
Contrastive Unalignment Loss (9)
At the identified layer 0, AGO Loss employs a contrastive mechanism to enforce separation between updated forget-set activations and their original principal subspaces. For each batch:
- Obtain forget activations from the updated model (1) and from a frozen reference model (2).
- Extract top-3 principal directions via SVD of 4 (5, 6 columns of 7).
- Generate a “principal offset” vector 8 by pushing a random seed away from these subspaces:
9
where 0 controls offset magnitude, 1 is optional noise, and 2 is a projection or nonlinearity.
- Compute cosine similarities between anchor (updated activation), positive (offset), and negatives (other frozen forget activations). The loss uses InfoNCE:
3
with temperature 4 empirically set to 5 [(Hu et al., 3 Feb 2025), Eqn. (7)].
Retention Alignment Loss (6)
To preserve retained knowledge, AGO Loss imposes a cosine alignment constraint between updated and frozen retain-set activations:
7
This self-supervised retention term curtails drift on the retain set [(Hu et al., 3 Feb 2025), Eqn. (8)].
4. Orthogonal Projection and Gradient Conflict Resolution
The competing objectives inherent in machine unlearning—maximal forgetting while retaining existing utility—can manifest as conflicting gradients during joint optimization. AGO Loss explicitly identifies and mitigates such conflict:
- Gradients of 8 (9) and 0 (1) are computed at 2.
- The cosine similarity 3 flags potential conflict (negative value).
- If a conflict is detected (4), the unlearning gradient is projected onto the orthogonal complement of the retention gradient:
5
- The final update is a weighted sum:
6
with 7 chosen as 8 when 9 and 0 when 1, prioritizing retention in the event of strong conflict [(Hu et al., 3 Feb 2025), Eqns. (10)-(11)].
5. Integration in FALCON Framework and Training Procedure
AGO Loss is integrated within FALCON’s end-to-end unlearning regime as follows:
- For every layer 2, mutual information 3 is computed via KDE and PCA on forget and retain set activations.
- The minimal-4 layer 5 is selected, all other parameters are frozen.
- Updates are performed solely on 6 using a second-order optimizer (e.g., Sophia with 7).
- Each step: batch forget/retain examples, compute respective activations, evaluate 8 (via SVD and InfoNCE), evaluate 9, compute and project gradients as necessary, and apply the update.
- Hyperparameters include 0 (for question answering tasks), 1 total update steps, 2 principal vectors, and standard Sophia optimizer settings (Hu et al., 3 Feb 2025).
6. Empirical Evaluation and Observed Effects
AGO Loss has been benchmarked on Zephyr-7B-Beta, Yi-6B-Chat, and Mistral-7B-Instruct LLMs. WMDP unlearning metrics (bio-score, cyber-score) drop substantially after AGO—e.g., for Zephyr-7B-Beta, bio-score decreases from 3 to 4—while held-out accuracy (MMLU) and perplexity (PPL) decrease by less than 5 (6 MMLU, 7 PPL remain). Gradient conflict analysis confirms lower interference for MI-selected layers, and knowledge recovery attacks (e.g., Enhanced GCG, 8 steps) recover less than 9 of erased information. Ablative experiments show that omitting any AGO Loss component—contrastive unalignment, gradient projection, or principal-offset vectors—yields diminished unlearning or excessive utility loss [(Hu et al., 3 Feb 2025), Table 2-3].
7. Significance and Conceptual Implications
AGO Loss establishes a layer-local, representation-guided approach to machine unlearning that explicitly addresses the antagonism between forgetting and retention. Activation-space MI guidance confines edits to minimally entangled regions, contrastive unalignment systematically expels unwanted knowledge, and the orthogonalization scheme operationalizes a local Pareto-front compromise. The empirical results from FALCON demonstrate that AGO Loss delivers principled, robust, and efficient machine unlearning, suggesting that representation-aware, gradient-level conflict resolution may have broader implications for multi-objective optimization within neural networks (Hu et al., 3 Feb 2025).