Papers
Topics
Authors
Recent
Assistant
AI Research Assistant
Well-researched responses based on relevant abstracts and paper content.
Custom Instructions Pro
Preferences or requirements that you'd like Emergent Mind to consider when generating responses.
Gemini 2.5 Flash
Gemini 2.5 Flash 134 tok/s
Gemini 2.5 Pro 41 tok/s Pro
GPT-5 Medium 28 tok/s Pro
GPT-5 High 29 tok/s Pro
GPT-4o 71 tok/s Pro
Kimi K2 208 tok/s Pro
GPT OSS 120B 426 tok/s Pro
Claude Sonnet 4.5 37 tok/s Pro
2000 character limit reached

Cross-Modal Attention Guided Unlearning

Updated 10 October 2025
  • Cross-Modal Attention Guided Unlearning is a paradigm that selectively removes sensitive details by transforming low-attention visual tokens in multi-modal models.
  • It utilizes a discriminator, cross-modal attention-guided selection, and a visual token encoder to target and erase undesirable memory traces without affecting primary outputs.
  • CAGUL preserves overall model performance and reduces computational overhead compared to full finetuning, offering a scalable solution for privacy compliance.

Cross-Modal Attention Guided Unlearning (CAGUL) is a paradigm for safely and efficiently “forgetting” or removing sensitive information from vision-LLMs (VLMs) and other multi-modal architectures. It leverages cross-modal attention to selectively target and transform those components—especially visual tokens—that contribute least to the model’s primary output, thereby erasing undesirable memory traces without retraining or altering core parameters. This approach addresses privacy risks and compliance challenges associated with large-scale multi-modal models that may inadvertently memorize or regurgitate private visual or textual content.

1. Rationale and Background

Traditional machine unlearning approaches, such as model finetuning or retraining from scratch, are inefficient and can degrade desirable model behavior. In VLMs—where multimodal association is encoded via attention between text and image representations—direct parameter updates incur substantial cost. CAGUL introduces an external strategy based on cross-modal attention matrices, informed by empirical findings that low-attention visual tokens are less critical to output generation yet may encode sensitive or undesirable details (Bhaila et al., 8 Oct 2025).

By focusing on tokens with the lowest cross-modal attention scores (from text queries to visual tokens), CAGUL enables targeted unlearning, removing sensitive traces without disrupting core model capabilities. This selective intervention contrasts with global retraining or finetuning, offering a lightweight and modular solution.

2. Technical Mechanism

CAGUL is implemented in three principal modules:

  • Discriminator Module (C₍φ₎): Determines if a given image / query pair belongs to the “forget” set (contains private/sensitive information).
  • Cross-Modal Attention-Guided Selection: Computes the cross-modal attention matrix between textual queries and visual tokens:

A=softmax(QKd)A = \mathrm{softmax}\left(\frac{QK^{\top}}{\sqrt{d}}\right)

where Q=ZqWqQ = Z_q W_q (text queries), K=ZvWkK = Z_v W_k (visual keys), dd is the embedding dimension. The average over query tokens yields the per-token attention vector αR1×nv\alpha \in \mathbb{R}^{1 \times n_v} for visual tokens.

  • Visual Token Encoder (F₍ψ₎): Applies a learned linear transformation to the kk lowest-attention visual tokens:

Z~v,i={Fψ(Zv,i)if iK Zv,iotherwise\tilde{Z}_{v,i} = \begin{cases} F_{ψ}(Z_{v,i}) & \text{if } i \in K\ Z_{v,i} & \text{otherwise} \end{cases}

The modified visual tokens Z~v\tilde{Z}_v and original textual tokens ZqZ_q are then passed into the LLM for output generation.

The combined objective function is: L=Lbce+Lf+Lr\mathcal{L} = \mathcal{L}_\mathrm{bce} + \mathcal{L}_f + \mathcal{L}_r where Lbce\mathcal{L}_\mathrm{bce} trains the discriminator, Lf\mathcal{L}_f enforces forgetting through preference optimization (steering towards refusals or safety responses for sensitive queries), and Lr\mathcal{L}_r preserves utility on non-sensitive (retain set) queries.

3. Role and Selection of Visual Tokens

Visual tokens produced by the vision transformer encapsulate fine-grained regions of the image. Cross-modal attention analysis reveals that only a minority of tokens—those most attended by the text—are crucial for output generation in tasks such as visual question answering. CAGUL exploits this by transforming the tokens with minimal text attention, which typically encode secondary or less relevant visual details.

Modifying these low-importance tokens (rather than high-attention ones) allows sensitive content to be erased from the model’s latent memory with minimal impact on generalization or utility, as confirmed by performance metrics on both “forget” and “retain” queries (Bhaila et al., 8 Oct 2025). This suggests that privacy risk mitigation can be accomplished through targeted token-level interventions guided by cross-modal attention.

4. Efficiency and Comparison to Finetuning-based Methods

A key advantage of CAGUL is efficiency. Instead of updating large-scale model weights—common in finetuning or retraining—CAGUL employs compact, external modules (discriminator, token encoder) to intervene on the input space. The base VLM remains frozen, drastically reducing the number of trainable parameters and the overall computational overhead.

Experimental results across multiple architectures and the FIUBench dataset show that CAGUL either matches or exceeds the effectiveness of finetuning-based baselines in terms of:

  • Lowering privacy/adversarial scores and leakage metrics on sensitive (“forget”) data
  • Preserving or improving utility and accuracy on downstream tasks and non-sensitive (“retain”) data
  • Dramatically reducing training time and memory requirements

A plausible implication is that CAGUL provides scalable privacy compliance for real-world deployment, circumventing the impracticality of retraining in high-resource scenarios.

5. Empirical Outcomes

On representative VLM architectures (e.g., LLaMA-3.2-11B-Vision-Instruct, Qwen-2.5-VL-7B-Instruct) and FIUBench, CAGUL demonstrates:

  • Pronounced reduction of privacy leakage metrics (Rouge, Exact Match, adversarial benchmarks) on the forget set
  • Stable or enhanced utility scores and accuracy on retain sets and general-purpose evaluation benchmarks (MME, POPE)
  • No observable degradation of general multimodal reasoning capability

Efficiency measurements indicate substantially fewer trainable parameters and lower training times versus full model editing approaches. Ablation studies further support the selection of token-level interventions guided by cross-modal attention, rather than random or uniformly distributed modifications.

6. Extensions and Future Directions

Challenges remain for broader adoption and further improvement of CAGUL:

  • Optimal Token Selection: Refinement of kk selection (number of visual tokens to transform) may benefit from adaptive or dynamic strategies according to query context or attention distribution.
  • Modal Generalization: Extension to other modalities, such as audio or structured data, may require integrating analogous cross-modal attention mechanisms.
  • Loss Design: Alternative objective formulations may provide better trade-offs between privacy reduction and utility retention, potentially supporting finer-grained regulatory compliance.
  • Integration with Textual Unlearning: Combining CAGUL with textual-domain unlearning (as studied in cross-modal safety alignment (Chakraborty et al., 27 May 2024)) could yield unified frameworks for privacy across all modalities.
  • Real-world Benchmarks: Expansion to diverse datasets and compliance scenarios (e.g., GDPR, CCPA) is necessary to validate the robustness and scalability of the approach.

A plausible implication is that CAGUL’s modular design and cross-modal guidance are well-suited for principled privacy interventions in multi-modal systems, including future generations of VLMs and multi-modal LLMs.


In summary, Cross-Modal Attention Guided Unlearning introduces an efficient and principled strategy for sensitive data forgetting in vision-LLMs by transforming visual tokens with low attention from the text query. It achieves effective privacy removal while preserving overall model utility and incurs minimal computational overhead compared to traditional model finetuning, marking a practical advance for multi-modal model safety and compliance (Bhaila et al., 8 Oct 2025).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (2)
Forward Email Streamline Icon: https://streamlinehq.com

Follow Topic

Get notified by email when new papers are published related to Cross-Modal Attention Guided Unlearning (CAGUL).

Don't miss out on important new AI/ML research

See which papers are being discussed right now on X, Reddit, and more:

“Emergent Mind helps me see which AI papers have caught fire online.”

Philip

Philip

Creator, AI Explained on YouTube