Lips Don't Lie: A Generalisable and Robust Approach to Face Forgery Detection (2012.07657v3)

Published 14 Dec 2020 in cs.CV

Abstract: Although current deep learning-based face forgery detectors achieve impressive performance in constrained scenarios, they are vulnerable to samples created by unseen manipulation methods. Some recent works show improvements in generalisation but rely on cues that are easily corrupted by common post-processing operations such as compression. In this paper, we propose LipForensics, a detection approach capable of both generalising to novel manipulations and withstanding various distortions. LipForensics targets high-level semantic irregularities in mouth movements, which are common in many generated videos. It consists in first pretraining a spatio-temporal network to perform visual speech recognition (lipreading), thus learning rich internal representations related to natural mouth motion. A temporal network is subsequently finetuned on fixed mouth embeddings of real and forged data in order to detect fake videos based on mouth movements without overfitting to low-level, manipulation-specific artefacts. Extensive experiments show that this simple approach significantly surpasses the state-of-the-art in terms of generalisation to unseen manipulations and robustness to perturbations, as well as shed light on the factors responsible for its performance. Code is available on GitHub.

Citations (324)

View on Semantic Scholar

Summary

The paper introduces LipForensics, a method that leverages high-level semantic lip motion anomalies to detect face forgeries.
It employs a spatio-temporal network pretrained on lipreading to learn robust, generalizable features for identifying manipulated videos.
Empirical evaluations on datasets like FaceShifter and DeeperForensics highlight its resilience to common post-processing perturbations and novel attack types.

LipForensics: Analyzing Robust and Generalizable Face Forgery Detection

The proliferation of deep generative models has heightened the need for robust face forgery detection systems capable of identifying manipulated videos created using various methodologies. "Lips Don't Lie: A Generalisable and Robust Approach to Face Forgery Detection" presents LipForensics—a framework that leverages lip movement inconsistencies to enhance the generalization and robustness of face forgery detectors.

Overview of LipForensics

LipForensics attempts to address the deficiencies of current detection systems that primarily excel in constrained scenarios but falter under novel manipulations or routine post-processing modifications. At its core, LipForensics preemptively aims to detect high-level semantic irregularities in mouth movements rather than relying on low-level manipulation traces. To achieve this, the framework utilizes a spatio-temporal network pretrained for visual speech recognition (lipreading), which capacitates the network to discern rich representations intrinsically connected to natural mouth motion. This network is then finetuned specifically on fixed mouth embeddings sourced from both real and manipulated data, focusing on the detection of forged content without the risk of overfitting to specific manipulation artifacts.

Significance in Robustness and Generalization

The paper underscores how LipForensics surpasses existing approaches by demonstrating remarkable generalization to unseen manipulation types and robustness against diverse perturbations. In empirical evaluations across various datasets and novel forgery techniques, LipForensics consistently outperforms prior state-of-the-art methods, including Face X-ray and other generalization-targeted detectors. Its strengths are pronounced in handling scenarios where detectors conventionally fail, such as unseen manipulation types from datasets like FaceShifter and DeeperForensics.

Moreover, LipForensics illustrates significant resilience to common perturbations including compression, pixelation, and different blurring techniques, establishing a capability crucial for real-world applications where videos frequently undergo post-processing. This heightened robustness is primarily attributed to its focus on high-level motion dynamics rather than reliance on low-level pixel artifacts.

Methodological Implications

The paper provides valuable insights into the significance of pretraining tasks aligned with the detection objectives. By employing a lipreading pretraining paradigm, LipForensics can embed semantic knowledge inherent in natural speech, an approach shown to outperform other pretraining datasets like Kinetics or large-scale face recognition datasets. This emphasizes the utility of targeted pretraining in rendering a model generalizable across unforeseen attack vectors.

The rigorous ablation studies demonstrate the essential roles of both spatio-temporal feature extraction and temporal convolutional networks in performance gains. Keeping the feature extractor frozen during finetuning helps mitigate overfitting risk, leveraging high-level features conducive to recognizing semantic inconsistencies rather than superficial cues potentially obliterated by video processing pipelines.

Future Directions

The research opens avenues for further exploration into pretrained models fueled by domain-specific, auxiliary tasks that substantially inform and enhance learning objectives. While LipForensics is tailored to detect manipulated videos with altered mouth movements, future work could contemplate analogous approaches adapted to other facial regions or attributes—augmenting detection based on comprehensive facial behavior analysis.

As forgery methods evolve, future developments could benefit from integrating complementary modalities, such as audio-visual coherence checks, contributing to even more robust multi-modal forgery detection frameworks. Additionally, enhancing scalability and inference efficiency for real-time applications remains paramount to counteract the rapid dissemination of forged media.

In conclusion, LipForensics delivers substantial advancements in face forgery detection by harnessing the intricacies of human mouth motion. As methods for face manipulations progress, adopting frameworks like LipForensics capable of generalizing and maintaining robustness against a backdrop of innovation in forgery technology will be increasingly vital in safeguarding digital content authenticity.

PDF Markdown