Papers
Topics
Authors
Recent
Search
2000 character limit reached

Visual-Guided Key-Token Regularization

Updated 5 February 2026
  • The paper introduces ViKeR, a method that uses visual guidance and token-level regularization to precisely forget sensitive content without degrading overall model performance.
  • ViKeR distinguishes between key and normal tokens through information entropy, ensuring retention of non-sensitive data while targeting privacy-critical tokens.
  • Experimental results on MLLMU and CLEAR benchmarks demonstrate superior retention of fluency and factuality compared to baseline methods.

Visual-Guided Key-Token Regularization (ViKeR) is a methodology for unlearning in multimodal LLMs (MLLMs) designed to ensure that sensitive information, particularly that associated with certain visual inputs, is effectively forgotten by the model without loss of general utility or fluency. Unlike prior approaches, ViKeR utilizes irrelevant visual cues to guide the regularization process at the token level, focusing forgetting pressure precisely on those answer tokens that are genuinely privacy-critical, as determined by their information entropy in the context of the unlearning objective (Cai et al., 29 Jan 2026).

1. Formulation of the MLLM Unlearning Problem

ViKeR is proposed in the context of auto-regressive MLLMs parameterized by θ. Training examples are structured as triples s=(I,x,y)s = (I, x, y), where II denotes an image, xx the question, and y=[y1,...,yy]y = [y_1, ..., y_{|y|}] the ground-truth answer tokens. At generation step ii, the token-level output distribution is given by pθ(vI,x,i)p_\theta(v|I, x, i). The standard negative log-likelihood loss over the (pre-unlearning) training dataset DfullD_{full} is

LNLL(Dfull;θ)=1Dfull(I,x,y)Dfulllogp(yI,x;θ)L_{NLL}(D_{full}; \theta) = -\frac{1}{|D_{full}|}\sum_{(I,x,y)\in D_{full}} \log p(y|I,x; \theta)

Unlearning is posed as updating model parameters using a designated forget set DfDfullD_f \subset D_{full} (where DfDfull|D_f| \ll |D_{full}|) such that the model achieves:

  1. Forgetting: For all (I,x,y)Df(I, x, y) \in D_f, θ\theta^* does not predict yy;
  2. Retention: For all (I,x,y)Df(I, x, y)\notin D_f, θ\theta^*'s behavior is preserved;
  3. Coherence: For all inputs (I,x)(I, x), outputs remain fluent.

2. Key Token Identification via Information Entropy

ViKeR addresses the unlearning challenge at token granularity by differentiating between key and normal tokens. The ideal post-unlearning token distribution at position ii in an answer yy is Riy(v)R_i^y(v) (approximated via visual guidance). The entropy H(q)H(q) for a token distribution qq over vocabulary VV is

H(q)=vVq(v)logq(v)H(q) = -\sum_{v \in V} q(v) \log q(v)

A token yiy_i is defined as normal if Riy(yi)1R_i^y(y_i) \to 1, leading to H(Riy)0H(R_i^y) \to 0. In contrast, yiy_i is considered a key token if H(Riy)ϵH(R_i^y) \geq \epsilon for some ϵ>0\epsilon > 0, reflecting uncertainty in its ideal distribution—typically associated with identity-revealing or sensitive content.

3. Visual-Guided Estimation of Ideal Token Distributions

To estimate Riy(v)R_i^y(v), ViKeR uses k irrelevant reference images I={Ij}j=1k\mathcal{I}' = \{I'_j\}_{j=1}^k (e.g., images of random celebrities not in DfD_f). For each (I,x,y)(I,x,y) in DfD_f, the MLLM (pre-unlearning) produces per-token distributions pθfull(vIj,x,i)p_{\theta_{full}}(v|I'_j, x, i). The estimated ideal distribution is the average over kk references:

R^iy(v)=1kj=1kpθfull(vIj,x,i)\hat{R}_i^y(v) = \frac{1}{k} \sum_{j=1}^k p_{\theta_{full}}(v|I'_j, x, i)

Normal tokens retain peaked distributions at yiy_i; key tokens’ distributions flatten, expressing uncertainty and thus diminishing memorization.

4. Regularization Strategy and Loss Function

The ViKeR loss combines a negative log-likelihood gradient-ascent term (to enforce forgetting in DfD_f) with a KL-regularization term aligning the post-unlearning token distributions with their ideal estimates:

LViKeR(Df;θ)=LNLL(Df;θ)+λDf(I,x,y)Dfi=1yKL(R^iyQ^iy)L_{ViKeR}(D_f; \theta) = -L_{NLL}(D_f; \theta) + \frac{\lambda}{|D_f|} \sum_{(I,x,y) \in D_f} \sum_{i=1}^{|y|} \mathrm{KL}(\hat{R}_i^y || \hat{Q}_i^y)

where Q^iy(v)=pθ(vI,x,i)\hat{Q}_i^y(v) = p_\theta(v|I,x,i) is the current model prediction, and λ[0,1]\lambda \in [0,1] controls the forgetting-to-coherence trade-off.

5. Token-Level Gradient Reweighting

A distinguishing mechanism of ViKeR is its effect on the token-wise learning signal. The gradient of the KL term with respect to θ satisfies

θLViKeR(v;θ,y,i)=[1{v=yi}λR^iy(v)]θLGA(v;θ,y,i)\nabla_\theta L_{ViKeR}(v;\theta, y, i) = [\mathbb{1}\{v = y_i\} - \lambda \hat{R}_i^y(v)] \cdot \nabla_\theta L_{GA}(v;\theta, y, i)

For normal tokens, R^iy(yi)1\hat{R}_i^y(y_i) \to 1, so the scale 1λ1 - \lambda reduces (or zeroes) the forgetting signal. For key tokens, R^iy(yi)\hat{R}_i^y(y_i) is small, so the scale 1λR^iy(yi)1 - \lambda \hat{R}_i^y(y_i) is substantially larger, amplifying the forgetting gradient. This selective pressure effectively erases only sensitive content while maintaining general fluency and factuality.

6. Experimental Validation and Benchmarks

ViKeR’s efficacy was empirically demonstrated on the MLLMU and CLEAR benchmarks:

Setting Forgetting (ACC/REC ↓) Retention (ROUGE/REC ↑) Coherence (GIB ↑)
MLLMU-15% ~32% ACC +41.4% ROUGE, +21.1% BLEU ~94.6%
CLEAR-10% +0.48% REC loss (Forget), +3.41% REC (Retain) Matches top QA Top-tier
  • Base: LLaVA-7B with LoRA (rank=8, α=16\alpha=16), vision encoder frozen.
  • Unlearning: AdamW (lr=5e-6), batch=2, single epoch.
  • Metrics: Multi-choice accuracy (forgetting), ROUGE-L/BLEU (content preservation), GIB (fluency).
  • Baselines: Gradient ascent (GA), Negative preference optimization (NPO), IdkPO.

These results show that ViKeR achieves competitive forgetting, with substantially higher content retention and output coherence relative to baselines. Visualization of token distributions confirms that ViKeR targets high-entropy (private) tokens for erasure, preserving other information.

7. Implementation Considerations and Ablations

Key hyperparameters include λ\lambda (controlling regularization strength; e.g., $0.05$ for MLLMU-10%, $0.5$ for MLLMU-15%, $10$ for CLEAR) and number of reference images k=5k=5, with performance stabilizing for k5k \geq 5.

Ablation studies indicate:

  1. Removing the regularizer reduces ViKeR to pure GA, resulting in total forgetting and incoherence.
  2. Omitting the GA term fails to achieve unlearning.
  3. Excluding visual guidance leads to poor retention.
  4. Substituting alternative regularizers (cosine similarity, JSD) yields inferior trade-offs.
  5. Using irrelevant people as references outperforms pets, scenes, or textures for the reference set.

ViKeR thus formulates multimodal model unlearning as token-level distribution alignment, regularized by visually guided ideal distributions, and achieves selective, entropy-based forgetting with efficient retention and coherence (Cai et al., 29 Jan 2026).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Visual-Guided Key-Token Regularization (ViKeR).