Visual-Guided Key-Token Regularization

Updated 5 February 2026

The paper introduces ViKeR, a method that uses visual guidance and token-level regularization to precisely forget sensitive content without degrading overall model performance.
ViKeR distinguishes between key and normal tokens through information entropy, ensuring retention of non-sensitive data while targeting privacy-critical tokens.
Experimental results on MLLMU and CLEAR benchmarks demonstrate superior retention of fluency and factuality compared to baseline methods.

Visual-Guided Key-Token Regularization (ViKeR) is a methodology for unlearning in multimodal LLMs (MLLMs) designed to ensure that sensitive information, particularly that associated with certain visual inputs, is effectively forgotten by the model without loss of general utility or fluency. Unlike prior approaches, ViKeR utilizes irrelevant visual cues to guide the regularization process at the token level, focusing forgetting pressure precisely on those answer tokens that are genuinely privacy-critical, as determined by their information entropy in the context of the unlearning objective (Cai et al., 29 Jan 2026).

1. Formulation of the MLLM Unlearning Problem

ViKeR is proposed in the context of auto-regressive MLLMs parameterized by θ. Training examples are structured as triples $s = (I, x, y)$ , where $I$ denotes an image, $x$ the question, and $y = [y_1, ..., y_{|y|}]$ the ground-truth answer tokens. At generation step $i$ , the token-level output distribution is given by $p_\theta(v|I, x, i)$ . The standard negative log-likelihood loss over the (pre-unlearning) training dataset $D_{full}$ is

$L_{NLL}(D_{full}; \theta) = -\frac{1}{|D_{full}|}\sum_{(I,x,y)\in D_{full}} \log p(y|I,x; \theta)$

Unlearning is posed as updating model parameters using a designated forget set $D_f \subset D_{full}$ (where $|D_f| \ll |D_{full}|$ ) such that the model achieves:

Forgetting: For all $(I, x, y) \in D_f$ , $\theta^*$ does not predict $y$ ;
Retention: For all $(I, x, y)\notin D_f$ , $\theta^*$ 's behavior is preserved;
Coherence: For all inputs $(I, x)$ , outputs remain fluent.

2. Key Token Identification via Information Entropy

ViKeR addresses the unlearning challenge at token granularity by differentiating between key and normal tokens. The ideal post-unlearning token distribution at position $i$ in an answer $y$ is $R_i^y(v)$ (approximated via visual guidance). The entropy $H(q)$ for a token distribution $q$ over vocabulary $V$ is

$H(q) = -\sum_{v \in V} q(v) \log q(v)$

A token $y_i$ is defined as normal if $R_i^y(y_i) \to 1$ , leading to $H(R_i^y) \to 0$ . In contrast, $y_i$ is considered a key token if $H(R_i^y) \geq \epsilon$ for some $\epsilon > 0$ , reflecting uncertainty in its ideal distribution—typically associated with identity-revealing or sensitive content.

3. Visual-Guided Estimation of Ideal Token Distributions

To estimate $R_i^y(v)$ , ViKeR uses k irrelevant reference images $\mathcal{I}' = \{I'_j\}_{j=1}^k$ (e.g., images of random celebrities not in $D_f$ ). For each $(I,x,y)$ in $D_f$ , the MLLM (pre-unlearning) produces per-token distributions $p_{\theta_{full}}(v|I'_j, x, i)$ . The estimated ideal distribution is the average over $k$ references:

$\hat{R}_i^y(v) = \frac{1}{k} \sum_{j=1}^k p_{\theta_{full}}(v|I'_j, x, i)$

Normal tokens retain peaked distributions at $y_i$ ; key tokens’ distributions flatten, expressing uncertainty and thus diminishing memorization.

4. Regularization Strategy and Loss Function

The ViKeR loss combines a negative log-likelihood gradient-ascent term (to enforce forgetting in $D_f$ ) with a KL-regularization term aligning the post-unlearning token distributions with their ideal estimates:

$L_{ViKeR}(D_f; \theta) = -L_{NLL}(D_f; \theta) + \frac{\lambda}{|D_f|} \sum_{(I,x,y) \in D_f} \sum_{i=1}^{|y|} \mathrm{KL}(\hat{R}_i^y || \hat{Q}_i^y)$

where $\hat{Q}_i^y(v) = p_\theta(v|I,x,i)$ is the current model prediction, and $\lambda \in [0,1]$ controls the forgetting-to-coherence trade-off.

5. Token-Level Gradient Reweighting

A distinguishing mechanism of ViKeR is its effect on the token-wise learning signal. The gradient of the KL term with respect to θ satisfies

$\nabla_\theta L_{ViKeR}(v;\theta, y, i) = [\mathbb{1}\{v = y_i\} - \lambda \hat{R}_i^y(v)] \cdot \nabla_\theta L_{GA}(v;\theta, y, i)$

For normal tokens, $\hat{R}_i^y(y_i) \to 1$ , so the scale $1 - \lambda$ reduces (or zeroes) the forgetting signal. For key tokens, $\hat{R}_i^y(y_i)$ is small, so the scale $1 - \lambda \hat{R}_i^y(y_i)$ is substantially larger, amplifying the forgetting gradient. This selective pressure effectively erases only sensitive content while maintaining general fluency and factuality.

6. Experimental Validation and Benchmarks

ViKeR’s efficacy was empirically demonstrated on the MLLMU and CLEAR benchmarks:

Setting	Forgetting (ACC/REC ↓)	Retention (ROUGE/REC ↑)	Coherence (GIB ↑)
MLLMU-15%	~32% ACC	+41.4% ROUGE, +21.1% BLEU	~94.6%
CLEAR-10%	+0.48% REC loss (Forget), +3.41% REC (Retain)	Matches top QA	Top-tier

Base: LLaVA-7B with LoRA (rank=8, $\alpha=16$ ), vision encoder frozen.
Unlearning: AdamW (lr=5e-6), batch=2, single epoch.
Metrics: Multi-choice accuracy (forgetting), ROUGE-L/BLEU (content preservation), GIB (fluency).
Baselines: Gradient ascent (GA), Negative preference optimization (NPO), IdkPO.

These results show that ViKeR achieves competitive forgetting, with substantially higher content retention and output coherence relative to baselines. Visualization of token distributions confirms that ViKeR targets high-entropy (private) tokens for erasure, preserving other information.

7. Implementation Considerations and Ablations

Key hyperparameters include $\lambda$ (controlling regularization strength; e.g., $0.05$ for MLLMU-10%, $0.5$ for MLLMU-15%, $10$ for CLEAR) and number of reference images $k=5$ , with performance stabilizing for $k \geq 5$ .

Ablation studies indicate:

Removing the regularizer reduces ViKeR to pure GA, resulting in total forgetting and incoherence.
Omitting the GA term fails to achieve unlearning.
Excluding visual guidance leads to poor retention.
Substituting alternative regularizers (cosine similarity, JSD) yields inferior trade-offs.
Using irrelevant people as references outperforms pets, scenes, or textures for the reference set.

ViKeR thus formulates multimodal model unlearning as token-level distribution alignment, regularized by visually guided ideal distributions, and achieves selective, entropy-based forgetting with efficient retention and coherence (Cai et al., 29 Jan 2026).

Markdown Report Issue Upgrade to Chat

References (1)

Visual-Guided Key-Token Regularization for Multimodal Large Language Model Unlearning (2026)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Visual-Guided Key-Token Regularization (ViKeR).