Backdoor Cleaning without External Guidance in MLLM Fine-tuning (2505.16916v1)

Published 22 May 2025 in cs.CR and cs.CV

Abstract: Multimodal LLMs (MLLMs) are increasingly deployed in fine-tuning-as-a-service (FTaaS) settings, where user-submitted datasets adapt general-purpose models to downstream tasks. This flexibility, however, introduces serious security risks, as malicious fine-tuning can implant backdoors into MLLMs with minimal effort. In this paper, we observe that backdoor triggers systematically disrupt cross-modal processing by causing abnormal attention concentration on non-semantic regions--a phenomenon we term attention collapse. Based on this insight, we propose Believe Your Eyes (BYE), a data filtering framework that leverages attention entropy patterns as self-supervised signals to identify and filter backdoor samples. BYE operates via a three-stage pipeline: (1) extracting attention maps using the fine-tuned model, (2) computing entropy scores and profiling sensitive layers via bimodal separation, and (3) performing unsupervised clustering to remove suspicious samples. Unlike prior defenses, BYE equires no clean supervision, auxiliary labels, or model modifications. Extensive experiments across various datasets, models, and diverse trigger types validate BYE's effectiveness: it achieves near-zero attack success rates while maintaining clean-task performance, offering a robust and generalizable solution against backdoor threats in MLLMs.

Collections

Sign up for free to add this paper to one or more collections.

Sign Up

Summary

Overview of "Backdoor Cleaning without External Guidance in MLLM Fine-tuning"

The paper "Backdoor Cleaning without External Guidance in MLLM Fine-tuning" introduces a novel framework called Believe Your Eyes (BYE) aimed at addressing backdoor vulnerabilities in Multimodal LLMs (MLLMs) fine-tuned through the Fine-tuning-as-a-Service (FTaaS) paradigm. This research observes that backdoor triggers systematically cause disruptions in cross-modal processing, which manifest as abnormal attention behavior—a phenomenon the authors term "attention collapse". The BYE framework leverages these attention patterns as self-supervised signals to identify and filter out backdoor samples from the dataset used for model fine-tuning.

Key Components and Methodology

The core of BYE is a three-stage pipeline:

Attention Map Extraction: BYE starts by extracting attention maps from the fine-tuned MLLM. This involves analyzing the allocation of model attention across image tokens, observed from decoding tokens initiating answer generation.
Entropy Calculation and Profiling: The entropy of attention maps is computed to measure dispersion. Notably, poisoned samples exhibit lower entropy due to concentration on non-semantic regions. Bimodal separation is profiled to identify layers with significant divergence in attention allocation between clean and poisoned samples.
Clustering and Filtering: Using Gaussian Mixture Models (GMM), BYE clusters samples based on attention entropy profiles, thereby identifying and removing suspicious samples that indicate backdoor influence. This stage operates without clean data supervision or auxiliary inputs, differentiating BYE from prior defenses which often depend on additional validation datasets or model modifications.

Experimental Results

The authors provide extensive experimental validation using multiple MLLMs such as LLaVA-v1.5 and InternVL2.5 across diverse tasks including visual question answering and image captioning. BYE achieves near-zero attack success rates while maintaining the integrity and performance of clean task evaluation data. Specifically, for ScienceQA and IconQA benchmarks, the ASR was reduced to 0.05% and 0.00%, respectively, demonstrating the framework's effectiveness in mitigating backdoor threats.

Implications and Future Directions

The implications of this research are notable for both practical security in deploying MLLMs and advancing theoretical understanding of how attention mechanisms can be leveraged for self-diagnosis. The self-supervised, unsupervised, and clean-reference-free nature of BYE presents a robust approach that addresses significant limitations of previous methods reliant on external guidance or assumptions about the triggers.

Looking forward, the development of frameworks such as BYE suggests several avenues for further exploration:

Integration with Ongoing Training: Expanding BYE's application from preprocessing steps to real-time or synchronous adaptations during model training could enhance its utility in dynamic environments.
Extension to Other Modality Combinations: Exploring the applicability of attention-driven purification in settings beyond visual and language modalities, such as audio-visual models, is an exciting direction for future research.
Adaptive Attack Resistance: Continued assessment under varied and evolving attack strategies will be essential to refine entropy-based detection mechanisms in increasingly complex threat environments.

In conclusion, the BYE framework advances the state of research on secure MLLM adaptation, providing insights and tools that are both innovative and practically applicable for securing AI systems in diverse domains.

Follow-up Questions

We haven't generated follow-up questions for this paper yet.

Generate Now

Related Papers

Find Related Papers

Authors (8)

Tweets

https://twitter.com/GptMaestro/status/1950296535100772444