Multi-task Learning For Detecting and Segmenting Manipulated Facial Images and Videos (1906.06876v1)

Published 17 Jun 2019 in cs.CV

Abstract: Detecting manipulated images and videos is an important topic in digital media forensics. Most detection methods use binary classification to determine the probability of a query being manipulated. Another important topic is locating manipulated regions (i.e., performing segmentation), which are mostly created by three commonly used attacks: removal, copy-move, and splicing. We have designed a convolutional neural network that uses the multi-task learning approach to simultaneously detect manipulated images and videos and locate the manipulated regions for each query. Information gained by performing one task is shared with the other task and thereby enhance the performance of both tasks. A semi-supervised learning approach is used to improve the network's generability. The network includes an encoder and a Y-shaped decoder. Activation of the encoded features is used for the binary classification. The output of one branch of the decoder is used for segmenting the manipulated regions while that of the other branch is used for reconstructing the input, which helps improve overall performance. Experiments using the FaceForensics and FaceForensics++ databases demonstrated the network's effectiveness against facial reenactment attacks and face swapping attacks as well as its ability to deal with the mismatch condition for previously seen attacks. Moreover, fine-tuning using just a small amount of data enables the network to deal with unseen attacks.

PDF Abstract

Multi-task Learning for Detecting and Segmenting Manipulated Facial Images and Videos

In the field of digital media forensics, the detection and segmentation of manipulated facial images and videos are of significant importance due to the proliferation of synthetic media, such as deepfakes. The paper, "Multi-task Learning For Detecting and Segmenting Manipulated Facial Images and Videos," presents a convolutional neural network (CNN) that employs a multi-task learning framework to directly tackle this challenge, effectively integrating detection and segmentation tasks.

The authors propose a method that uses a CNN with a Y-shaped autoencoder, which is designed to enhance performance by simultaneously conducting classification and segmentation of manipulated media. The network leverages a semi-supervised learning approach, dividing its tasks into three branches: classification, segmentation, and reconstruction. This methodology allows for shared learning across tasks, thereby improving the generalizability and effectiveness of the network. Specifically, the encoder-decoder architecture, where the network's encoder extracts feature encodings and the Y-shaped decoder outputs both manipulation probability and segmentation maps, facilitates this process.

For the empirical evaluation, the network was assessed using the FaceForensics and FaceForensics++ datasets, widely recognized for containing manipulated media. These datasets include various forms of manipulation such as facial reenactment and face swapping. Strong performance indicators were observed, with the network demonstrating robust capabilities in detecting both seen and unseen attacks with additional fine-tuning. Notably, the segmentation outcomes provide valuable insights into manipulated regions, achieving high accuracy, especially when coping with FaceSwap manipulations.

The architectural design also underscores the benefit of using a deeper CNN compared to traditional shallower networks, showcasing superior classification accuracy and adaptability. The autoencoder’s reconstruction branch was found to notably assist in boosting segmentation accuracy, highlighting the advantage of the novel loss weighting strategy devised by the authors.

From a theoretical standpoint, this approach exemplifies the ongoing advancements in leveraging multi-task learning and deep neural networks to address complex multimedia forensic challenges. It aligns with the growing trend in using shared feature learning to improve task performance across multiple domains. Practically, the proposed method underscores its potential as a reliable tool for real-world applications, offering an adept mechanism for distinguishing between authentic and manipulated media content.

As future work, the paper recognizes several avenues for improvement, such as exploring the use of residual images to further enhance the autoencoder’s capabilities, processing inputs without resizing to preserve high-resolution details, and expanding evaluation to encompass audiovisual manipulations. Advancing these areas can potentially lead to heightened detection efficacy and broader applicability in varied contexts of multimedia manipulation.

In summary, the paper presents a methodologically sound and empirically validated model for manipulating media detection and segmentation, contributing valuable insights and advancements to the field of digital forensic science.

PDF Markdown Bookmark Chat (Pro)

Authors (4)

Huy H. Nguyen (36 papers)
Fuming Fang (13 papers)
Junichi Yamagishi (178 papers)
Isao Echizen (83 papers)

Citations (404)

View on Semantic Scholar

Multi-task Learning For Detecting and Segmenting Manipulated Facial Images and Videos (1906.06876v1)

Multi-task Learning for Detecting and Segmenting Manipulated Facial Images and Videos

Related Papers