Papers
Topics
Authors
Recent
2000 character limit reached

M4-BLIP: Advancing Multi-Modal Media Manipulation Detection through Face-Enhanced Local Analysis (2512.01214v1)

Published 1 Dec 2025 in cs.CV and cs.AI

Abstract: In the contemporary digital landscape, multi-modal media manipulation has emerged as a significant societal threat, impacting the reliability and integrity of information dissemination. Current detection methodologies in this domain often overlook the crucial aspect of localized information, despite the fact that manipulations frequently occur in specific areas, particularly in facial regions. In response to this critical observation, we propose the M4-BLIP framework. This innovative framework utilizes the BLIP-2 model, renowned for its ability to extract local features, as the cornerstone for feature extraction. Complementing this, we incorporate local facial information as prior knowledge. A specially designed alignment and fusion module within M4-BLIP meticulously integrates these local and global features, creating a harmonious blend that enhances detection accuracy. Furthermore, our approach seamlessly integrates with LLMs (LLM), significantly improving the interpretability of the detection outcomes. Extensive quantitative and visualization experiments validate the effectiveness of our framework against the state-of-the-art competitors.

Summary

We haven't generated a summary for this paper yet.

Whiteboard

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Continue Learning

We haven't generated follow-up questions for this paper yet.

Collections

Sign up for free to add this paper to one or more collections.