Multi-attentional Deepfake Detection (2103.02406v3)

Published 3 Mar 2021 in cs.CV

Abstract: Face forgery by deepfake is widely spread over the internet and has raised severe societal concerns. Recently, how to detect such forgery contents has become a hot research topic and many deepfake detection methods have been proposed. Most of them model deepfake detection as a vanilla binary classification problem, i.e, first use a backbone network to extract a global feature and then feed it into a binary classifier (real/fake). But since the difference between the real and fake images in this task is often subtle and local, we argue this vanilla solution is not optimal. In this paper, we instead formulate deepfake detection as a fine-grained classification problem and propose a new multi-attentional deepfake detection network. Specifically, it consists of three key components: 1) multiple spatial attention heads to make the network attend to different local parts; 2) textural feature enhancement block to zoom in the subtle artifacts in shallow features; 3) aggregate the low-level textural feature and high-level semantic features guided by the attention maps. Moreover, to address the learning difficulty of this network, we further introduce a new regional independence loss and an attention guided data augmentation strategy. Through extensive experiments on different datasets, we demonstrate the superiority of our method over the vanilla binary classifier counterparts, and achieve state-of-the-art performance.

Authors (6)

Hanqing Zhao (27 papers)
Wenbo Zhou (35 papers)
Dongdong Chen (164 papers)
Tianyi Wei (19 papers)
Weiming Zhang (135 papers)
Nenghai Yu (173 papers)

Citations (522)

View on Semantic Scholar

Summary

Multi-attentional Deepfake Detection

The paper presents an innovative approach to deepfake detection by reframing it as a fine-grained classification problem rather than the conventional binary classification task. This shift in perspective is based on the observation that the differences between real and fake images are often subtle and localized, requiring a more nuanced method to capture these discrepancies effectively.

Methodology

The authors propose a multi-attentional deepfake detection network comprising three main components:

Multiple Spatial Attention Heads: These heads enable the network to focus on different local regions of the image. By predicting multiple spatial attention maps, the system can differentiate artifacts distributed across various parts of a face, improving detection capabilities.
Textural Feature Enhancement Block: This component enhances subtle artifacts by focusing on shallow features, which are rich in high-frequency information potentially indicating manipulation. The enhancement block utilizes densely connected convolutional layers to ensure these differences are accentuated rather than lost in deeper layers.
Aggregation of Textural and Semantic Features: Guided by attention maps, the method aggregates both low-level textural features and high-level semantic features, capturing both detailed local variations and overall facial structure.

In addition to the novel network architecture, the authors address the challenges of learning in this multi-attentional context by introducing a regional independence loss and an attention-guided data augmentation strategy. These innovations ensure that multiple attention heads can function independently, thus covering diverse regions of interest.

Experimental Results

Extensive experimentation on datasets such as FaceForensics++, Celeb-DF, and DFDC demonstrated the superiority of this approach over existing binary classifier-based methods, achieving state-of-the-art performance. Notably, on the high-quality versions of datasets, this method consistently outperformed, though it showed sensitivity to high compression rates in low-quality data.

Implications and Future Directions

The finer granularity in detection enabled by this multi-attentional approach has significant implications for both the theoretical understanding and practical handling of deepfake detection. By aligning the problem more closely with fine-grained classification frameworks, the method opens new avenues for leveraging advances in this area, potentially inspiring future research to explore similar redefinitions in other domains.

Future developments might focus on enhancing robustness to compression artifacts and exploring further refinements in attention mechanisms to better capture and interpret nuanced local features. Additionally, expanding this approach to broader types of media beyond facial images could be a promising direction.

The paper presents a well-founded case for rethinking how deepfakes are detected, providing a clear path forward with robust empirical support and offering a platform for subsequent explorations into the nuances of face forgery detection.

PDF Markdown

Related Papers

Find Related Papers