- The paper introduces a detection method that exploits differences between full-face and central-face head pose estimations to identify Deep Fakes.
- It leverages 68 facial landmarks and affine transformations to compute 3D head poses, revealing inconsistencies in synthesized facial regions.
- Experiments on UADFV and DARPA datasets achieved AUROC scores up to 0.974, showcasing the method's robustness for media forensics.
Exposing Deep Fakes Using Inconsistent Head Poses
The paper "Exposing Deep Fakes Using Inconsistent Head Poses" by Xin Yang, Yuezun Li, and Siwei Lyu presents a method to detect AI-generated fake face images and videos, known as Deep Fakes, by leveraging inconsistencies in 3D head poses. The method capitalizes on the observation that Deep Fakes, which are synthesized by splicing a generated face region into the original image, introduce detectable errors when 3D head poses are extracted from the face images.
Methodology
The proposed detection system is based on the deviations in facial landmark placements within the synthesized faces, which are used to estimate head poses. Specifically, neural network synthesis in Deep Fakes typically matches a synthesized face to the original person's facial expression but does not ensure consistent facial landmarks, leading to inherent mismatches. The approach compares 3D head poses estimated using facial landmarks from the entire face and those from the central face region alone.
Head poses are extracted using a set of 68 facial landmarks, aligning 2D landmarks to a 3D average face model via affine transformations. The core hypothesis is that for real faces, the head poses calculated from the whole face and central face region should align closely, while for Deep Fake faces, these head poses will diverge due to landmark inconsistencies introduced during the synthesis process.
Experimental Validation
The study validates its hypothesis by conducting experiments using SVM classifiers trained on the differences between these head pose estimates. The experimental setup involves two datasets: UADFV, containing real and corresponding Deep Fake videos, and a subset of the DARPA MediFor GAN Image/Video Challenge. Training is conducted on frames from 35 real and 35 Deep Fake videos, while testing is performed on frames from the remaining videos and the images from the DARPA dataset.
Noteworthy numerical results include:
- AUROC of 0.89 on the UADFV dataset for image-level classification.
- AUROC of 0.843 on the DARPA dataset, which includes more challenging images due to blur and landmark prediction difficulties.
- Averaging classification results over video frames significantly improves video-level AUROC to 0.974, demonstrating robustness to noise and variability in individual frame predictions.
Implications and Future Work
The implications of this research are both practical and theoretical. Practically, it provides a reliable method for identifying Deep Fake content, which has significant applications in media forensics. Theoretically, it highlights intrinsic limitations in current Deep Fake generation methods related to facial landmark inconsistencies.
Future developments in this research area could involve enhancing the robustness of facial landmark detection and head pose estimation, potentially incorporating more advanced neural architectures or refining the SVM classifier with more sophisticated features. Additionally, cross-domain evaluations with other types of fake media could generalize and improve the detection framework's versatility.
Overall, this study marks a significant step towards effective Deep Fake detection, leveraging fundamental inconsistencies in synthesized face data that are challenging to correct in current generation pipelines.