Video Face Manipulation Detection Through Ensemble of CNNs (2004.07676v1)

Published 16 Apr 2020 in cs.CV, cs.MM, and eess.IV

Abstract: In the last few years, several techniques for facial manipulation in videos have been successfully developed and made available to the masses (i.e., FaceSwap, deepfake, etc.). These methods enable anyone to easily edit faces in video sequences with incredibly realistic results and a very little effort. Despite the usefulness of these tools in many fields, if used maliciously, they can have a significantly bad impact on society (e.g., fake news spreading, cyber bullying through fake revenge porn). The ability of objectively detecting whether a face has been manipulated in a video sequence is then a task of utmost importance. In this paper, we tackle the problem of face manipulation detection in video sequences targeting modern facial manipulation techniques. In particular, we study the ensembling of different trained Convolutional Neural Network (CNN) models. In the proposed solution, different models are obtained starting from a base network (i.e., EfficientNetB4) making use of two different concepts: (i) attention layers; (ii) siamese training. We show that combining these networks leads to promising face manipulation detection results on two publicly available datasets with more than 119000 videos.

PDF Abstract

Analyzing Video Face Manipulation Detection Using CNN Ensembles

In the domain of digital forensics, the detection of manipulated video content, specifically in the form of deepfakes, continues to be a prominent challenge. The paper "Video Face Manipulation Detection Through Ensemble of CNNs" addresses this issue by proposing a sophisticated approach leveraging convolutional neural networks (CNNs) within an ensemble framework to determine the authenticity of facial content in video sequences.

Research Context and Objectives

The growing accessibility of facial manipulation technologies, such as DeepFakes and FaceSwap, has led to societal concerns regarding misinformation and digital privacy. This paper seeks to mitigate such threats by developing robust methods for detecting manipulated facial videos. The authors approach this problem by exploiting modern CNN architectures, with a particular focus on the EfficientNet family, to enhance detection capabilities.

Methodology Overview

The authors propose an ensemble of CNN models to improve facial manipulation detection. The primary architecture employed is the EfficientNetB4, noted for its efficient scaling capabilities and superior performance in image classification tasks. Two variations of this model are examined:

EfficientNetB4 with Attention Mechanism (EfficientNetB4Att): This variant introduces an attention mechanism that enables the network to focus on discriminative regions of the input video frames, thus potentially enhancing the detection of subtle forensics traces indicative of manipulation.
Siamese Network Training (EfficientNetB4ST and EfficientNetB4AttST): The paper examines a siamese training strategy that uses triplet loss to refine feature representation, ensuring that the learned embeddings differentiate real and manipulated facial sequences effectively. This approach aims to improve the network's generalization by emphasizing inter-class variance.

Experimental Evaluation

The models have been rigorously evaluated on two significant datasets: FaceForensics++ (FF++) and the DeepFake Detection Challenge (DFDC) dataset. The FF++ dataset includes various synthesis methods at different quality levels, while the DFDC dataset offers an extensive collection of deepfake videos. Notable outcomes from these evaluations include:

The ensemble approach comprising multiple models consistently outperformed single models in terms of Area Under the Curve (AUC) and LogLoss metrics.
The incorporation of attention mechanisms was shown to highlight visually significant areas of the face, such as the eyes and mouth, which are critical in identifying manipulation artifacts.
Models trained with siamese strategies demonstrated improved clustering of real versus fake video frames in feature space, as evidenced by t-SNE visualizations.

Implications and Future Directions

The implications of this paper are twofold. Practically, the research provides a potential framework for enhancing real-time detection systems that could be leveraged by media platforms to flag problematic content. Theoretically, it opens avenues for exploring hybrid training strategies and architectural innovations, such as attention mechanisms, within neural networks for forensic purposes.

Looking forward, future work could involve integrating temporal dynamics to account for inconsistencies across video frames, leveraging more complex temporal modeling techniques. Moreover, tackling adversarial robustness and the explainability of model decisions are critical areas that could further amplify the utility of face manipulation detection systems.

In conclusion, the paper's strategic use of CNN ensembles within a carefully constructed architecture demonstrates a meaningful step towards counteracting malicious video manipulations, offering both conceptual insights and practical solutions to the ongoing digital forensics challenge.

PDF Markdown Bookmark Chat (Pro)

Authors (6)

Nicolò Bonettini (7 papers)
Edoardo Daniele Cannas (6 papers)
Sara Mandelli (24 papers)
Luca Bondi (15 papers)
Paolo Bestagini (61 papers)
Stefano Tubaro (55 papers)

Citations (183)

View on Semantic Scholar