Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
194 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
46 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

MesoNet: a Compact Facial Video Forgery Detection Network (1809.00888v1)

Published 4 Sep 2018 in cs.CV and eess.IV

Abstract: This paper presents a method to automatically and efficiently detect face tampering in videos, and particularly focuses on two recent techniques used to generate hyper-realistic forged videos: Deepfake and Face2Face. Traditional image forensics techniques are usually not well suited to videos due to the compression that strongly degrades the data. Thus, this paper follows a deep learning approach and presents two networks, both with a low number of layers to focus on the mesoscopic properties of images. We evaluate those fast networks on both an existing dataset and a dataset we have constituted from online videos. The tests demonstrate a very successful detection rate with more than 98% for Deepfake and 95% for Face2Face.

Citations (1,112)

Summary

  • The paper demonstrates the development of low-complexity networks (Meso-4 and MesoInception-4) that achieve high detection accuracy for Deepfake and Face2Face videos.
  • It introduces novel architectures using Inception modules and dilated convolutions to capture multi-scale features, enhancing forgery detection in compressed videos.
  • Experimental results showcase frame-level accuracies up to 91.7% and video-level accuracy up to 98.4%, while highlighting challenges under heavy video compression.

MesoNet: A Compact Facial Video Forgery Detection Network

The rapid proliferation of digital images and videos in recent years has fueled both the utility and dangers associated with these assets. Techniques such as Deepfake and Face2Face facilitate the creation of hyper-realistic forged videos, which pose significant risks by disseminating misinformation and compromising privacy. Traditional image forensics methods fall short in addressing the intricacies of video forgeries, primarily due to the extensive compression involved in video data. This paper introduces an approach focused on leveraging deep learning to detect face tampering in videos using two low-complexity convolutional neural networks: Meso-4 and MesoInception-4.

Overview of Deepfake and Face2Face

Deepfake

Deepfake technology replaces a person's face in a video with another person's face using dual auto-encoders with shared encoder weights. These auto-encoders are trained separately on datasets comprising facial images from the original and target individuals. The encoder captures generalized facial attributes like illumination, pose, and expression, while decoders learn unique facial characteristics. Once trained, the encoder can process a face from one video and the decoder generates a corresponding face with the characteristics of another person. Despite its efficacy, Deepfake-generated faces often lack fine details and appear blurry due to the constraints of auto-encoder dimensionality reduction.

Face2Face

Face2Face reenacts the facial expressions of a source individual onto a target individual in real-time using RGB-camera data. It requires a pre-recorded sequence of the target person to build a facial model, which is then morphed to mimic the source's expressions during runtime. The technique involves overlaying the target's face with a blendshape model to achieve photorealism.

Proposed Method

The proposed method utilizes two novel architectures: Meso-4 and MesoInception-4, designed to focus on mesoscopic properties of images for forgery detection.

Meso-4

Meso-4 comprises four layers of convolution and pooling, followed by a dense network. The convolutional layers utilize ReLU activations and Batch Normalization to combat the vanishing gradient problem, while the dense layers use Dropout for improved generalization and robustness. This network has 27,977 trainable parameters.

MesoInception-4

An alternative approach, MesoInception-4, integrates inception modules introduced by Szegedy et al. These modules stack multiple convolutions with different kernel sizes to better capture multi-scale features. The proposed model replaces 5×55 \times 5 convolutions with dilated 3×33 \times 3 convolutions, which preserves important mesoscopic properties. This architecture has a slightly higher parameter count at 28,615.

Experimental Evaluation

The experiments are conducted on datasets created from real and forged videos, specifically targeting Deepfake and Face2Face forgeries.

Deepfake Dataset Results

The models are tested on a dataset comprising 175 Deepfake videos and equivalent genuine videos. On a frame-by-frame basis, Meso-4 achieves a classification accuracy of 89.1%, while MesoInception-4 achieves 91.7%. When aggregated over entire videos, the detection accuracy reaches 98.4% for MesoInception-4.

Face2Face Dataset Results

Utilizing the FaceForensics dataset, which includes different compression levels, both models demonstrate strong performance at lossless and lightly compressed videos (≥ 92%). However, the detection accuracy drops significantly under heavy compression (Meso-4: 83.2%, MesoInception-4: 81.3%). Video-level aggregation also boosts detection rates to approximately 95%.

Implications and Future Directions

The promising results from Meso-4 and MesoInception-4 underscore the significance of deep learning in video forensics. However, video compression remains a challenging aspect, affecting model efficacy. Future research may delve into optimizing networks that maintain high accuracy under varying compression levels. Additionally, leveraging advanced visualization techniques can further unravel deep neural network decision-making, enhancing both performance and interpretability.

This paper sets a meaningful precedent in the detection of face forgeries in videos, with practical implications for combating malicious uses of synthetic media. The proposed architectures, Meso-4 and MesoInception-4, offer a balanced trade-off between computational efficiency and detection accuracy, making them viable options for real-world applications. Future developments will likely refine these methodologies, extending their applicability across broader spectrum of synthetic media.

Youtube Logo Streamline Icon: https://streamlinehq.com