- The paper demonstrates the development of low-complexity networks (Meso-4 and MesoInception-4) that achieve high detection accuracy for Deepfake and Face2Face videos.
- It introduces novel architectures using Inception modules and dilated convolutions to capture multi-scale features, enhancing forgery detection in compressed videos.
- Experimental results showcase frame-level accuracies up to 91.7% and video-level accuracy up to 98.4%, while highlighting challenges under heavy video compression.
MesoNet: A Compact Facial Video Forgery Detection Network
The rapid proliferation of digital images and videos in recent years has fueled both the utility and dangers associated with these assets. Techniques such as Deepfake and Face2Face facilitate the creation of hyper-realistic forged videos, which pose significant risks by disseminating misinformation and compromising privacy. Traditional image forensics methods fall short in addressing the intricacies of video forgeries, primarily due to the extensive compression involved in video data. This paper introduces an approach focused on leveraging deep learning to detect face tampering in videos using two low-complexity convolutional neural networks: Meso-4 and MesoInception-4.
Overview of Deepfake and Face2Face
Deepfake
Deepfake technology replaces a person's face in a video with another person's face using dual auto-encoders with shared encoder weights. These auto-encoders are trained separately on datasets comprising facial images from the original and target individuals. The encoder captures generalized facial attributes like illumination, pose, and expression, while decoders learn unique facial characteristics. Once trained, the encoder can process a face from one video and the decoder generates a corresponding face with the characteristics of another person. Despite its efficacy, Deepfake-generated faces often lack fine details and appear blurry due to the constraints of auto-encoder dimensionality reduction.
Face2Face
Face2Face reenacts the facial expressions of a source individual onto a target individual in real-time using RGB-camera data. It requires a pre-recorded sequence of the target person to build a facial model, which is then morphed to mimic the source's expressions during runtime. The technique involves overlaying the target's face with a blendshape model to achieve photorealism.
Proposed Method
The proposed method utilizes two novel architectures: Meso-4 and MesoInception-4, designed to focus on mesoscopic properties of images for forgery detection.
Meso-4
Meso-4 comprises four layers of convolution and pooling, followed by a dense network. The convolutional layers utilize ReLU activations and Batch Normalization to combat the vanishing gradient problem, while the dense layers use Dropout for improved generalization and robustness. This network has 27,977 trainable parameters.
MesoInception-4
An alternative approach, MesoInception-4, integrates inception modules introduced by Szegedy et al. These modules stack multiple convolutions with different kernel sizes to better capture multi-scale features. The proposed model replaces 5×5 convolutions with dilated 3×3 convolutions, which preserves important mesoscopic properties. This architecture has a slightly higher parameter count at 28,615.
Experimental Evaluation
The experiments are conducted on datasets created from real and forged videos, specifically targeting Deepfake and Face2Face forgeries.
Deepfake Dataset Results
The models are tested on a dataset comprising 175 Deepfake videos and equivalent genuine videos. On a frame-by-frame basis, Meso-4 achieves a classification accuracy of 89.1%, while MesoInception-4 achieves 91.7%. When aggregated over entire videos, the detection accuracy reaches 98.4% for MesoInception-4.
Face2Face Dataset Results
Utilizing the FaceForensics dataset, which includes different compression levels, both models demonstrate strong performance at lossless and lightly compressed videos (≥ 92%). However, the detection accuracy drops significantly under heavy compression (Meso-4: 83.2%, MesoInception-4: 81.3%). Video-level aggregation also boosts detection rates to approximately 95%.
Implications and Future Directions
The promising results from Meso-4 and MesoInception-4 underscore the significance of deep learning in video forensics. However, video compression remains a challenging aspect, affecting model efficacy. Future research may delve into optimizing networks that maintain high accuracy under varying compression levels. Additionally, leveraging advanced visualization techniques can further unravel deep neural network decision-making, enhancing both performance and interpretability.
This paper sets a meaningful precedent in the detection of face forgeries in videos, with practical implications for combating malicious uses of synthetic media. The proposed architectures, Meso-4 and MesoInception-4, offer a balanced trade-off between computational efficiency and detection accuracy, making them viable options for real-world applications. Future developments will likely refine these methodologies, extending their applicability across broader spectrum of synthetic media.