Exploring Self-Supervised Vision Transformers for Deepfake Detection: A Comparative Analysis (2405.00355v2)
Abstract: This paper investigates the effectiveness of self-supervised pre-trained vision transformers (ViTs) compared to supervised pre-trained ViTs and conventional neural networks (ConvNets) for detecting facial deepfake images and videos. It examines their potential for improved generalization and explainability, especially with limited training data. Despite the success of transformer architectures in various tasks, the deepfake detection community is hesitant to use large ViTs as feature extractors due to their perceived need for extensive data and suboptimal generalization with small datasets. This contrasts with ConvNets, which are already established as robust feature extractors. Additionally, training ViTs from scratch requires significant resources, limiting their use to large companies. Recent advancements in self-supervised learning (SSL) for ViTs, like masked autoencoders and DINOs, show adaptability across diverse tasks and semantic segmentation capabilities. By leveraging SSL ViTs for deepfake detection with modest data and partial fine-tuning, we find comparable adaptability to deepfake detection and explainability via the attention mechanism. Moreover, partial fine-tuning of ViTs is a resource-efficient option.
- GPT-4 technical report. arXiv preprint arXiv:2303.08774, 2023.
- COYO-700M: Image-text pair dataset. https://github.com/kakaobrain/coyo-dataset, 2022.
- The age of synthetic realities: Challenges and opportunities. APSIPA Transactions on Signal and Information Processing, 12(1), 2023.
- Unsupervised learning of visual features by contrasting cluster assignments. Advances in NeurIPS, 33:9912–9924, 2020.
- Emerging properties in self-supervised vision transformers. In ICCV, pages 9650–9660, 2021.
- Perception prioritized training of diffusion models. In CVPR, pages 11472–11481, 2022.
- StarGAN: Unified generative adversarial networks for multi-domain image-to-image translation. In CVPR, pages 8789–8797, 2018.
- StarGAN v2: Diverse image synthesis for multiple domains. In CVPR, pages 8188–8197, 2020.
- F. Chollet. Xception: Deep learning with depthwise separable convolutions. In CVPR, pages 1251–1258, 2017.
- VoxCeleb2: Deep speaker recognition. In INTERSPEECH, pages 1086–1090, 2018.
- Unveiling the impact of image transformations on deepfake detection: An experimental analysis. In ICIAP, pages 345–356. Springer, 2023.
- Combining efficientnet and vision transformers for video deepfake detection. In ICIAP, pages 219–229. Springer, 2022.
- Vision transformers need registers. In ICLR, 2024.
- ImageNet: A large-scale hierarchical image database. In CVPR, pages 248–255. IEEE, 2009.
- The deepfake detection challenge (DFDC) dataset. arXiv preprint arXiv:2006.07397, 2020.
- An image is worth 16x16 words: Transformers for image recognition at scale. In ICLR, 2020.
- N. Dufour and A. Gully. Contributing data to deepfake detection research. https://ai.googleblog.com/2019/09/contributing-data-to-deepfake-detection.html, 9 2019.
- Delving into sequential patches for deepfake detection. Advances in NeurIPS, 35:4517–4530, 2022.
- Momentum contrast for unsupervised visual representation learning. In CVPR, pages 9729–9738, 2020.
- Deepfake detection algorithm based on improved vision transformer. Applied Intelligence, 53(7):7512–7527, 2023.
- Distilling the knowledge in a neural network. arXiv preprint arXiv:1503.02531, 2015.
- Progressive growing of gans for improved quality, stability, and variation. In ICLR, 2018.
- A style-based generator architecture for generative adversarial networks. In CVPR, pages 4401–4410, 2019.
- Analyzing and improving the image quality of StyleGAN. In CVPR, pages 8110–8119, 2020.
- J. D. M.-W. C. Kenton and L. K. Toutanova. BERT: Pre-training of deep bidirectional transformers for language understanding. In NAACL-HLT, pages 4171–4186, 2019.
- S. A. Khan and H. Dai. Video transformer for deepfake detection with incremental learning. In ACM MM, pages 1821–1828, 2021.
- P. Korshunov and S. Marcel. Deepfakes: a new threat to face recognition? assessment and detection. arXiv preprint arXiv:1812.08685, 2018.
- Cost sensitive optimization of deepfake detector. In APSIPA ASC, pages 1300–1303. IEEE, 2020.
- Celeb-df: A large-scale challenging dataset for deepfake forensics. In CVPR, pages 3207–3216, 2020.
- Deepfake detection with multi-scale convolution and vision transformer. Digital Signal Processing, 134:103895, 2023.
- Forgery-aware adaptive transformer for generalizable synthetic image detection. In CVPR, 2023.
- How close are other computer vision tasks to deepfake detection? In IJCB, pages 1–10. IEEE, 2023.
- Towards universal fake image detectors that generalize across generative models. In CVPR, pages 24480–24489, 2023.
- Dinov2: Learning robust visual features without supervision. Transactions on Machine Learning Research, 2023.
- Learning transferable visual models from natural language supervision. In ICML, pages 8748–8763. PMLR, 2021.
- Language models are unsupervised multitask learners. OpenAI blog, 1(8):9, 2019.
- Deepfake detection: A systematic literature review. IEEE access, 10:25494–25513, 2022.
- High-resolution image synthesis with latent diffusion models. In CVPR, pages 10684–10695, 2022.
- Faceforensics++: Learning to detect manipulated facial images. In ICCV, pages 1–11, 2019.
- C. Sanderson and B. C. Lovell. Multi-region probabilistic histograms for robust and scalable identity inference. In ICB, pages 199–208. Springer, 2009.
- LAION-5B: An open large-scale dataset for training next generation image-text models. Advances in NeurIPS, 35:25278–25294, 2022.
- Eva-clip: Improved training techniques for clip at scale. arXiv preprint arXiv:2303.15389, 2023.
- Resolution-robust large mask inpainting with fourier convolutions. In WACV, pages 2149–2159, 2022.
- M. Tan and Q. Le. Efficientnet: Rethinking model scaling for convolutional neural networks. In ICML, pages 6105–6114. PMLR, 2019.
- M. Tan and Q. Le. Efficientnetv2: Smaller models and faster training. In ICML, pages 10096–10106. PMLR, 2021.
- Deepfakes and beyond: A survey of face manipulation and fake detection. Information Fusion, 64:131–148, 2020.
- Training data-efficient image transformers & distillation through attention. In ICML, pages 10347–10357. PMLR, 2021.
- DeiT III: Revenge of the VIT. In European conference on computer vision, pages 516–533. Springer, 2022.
- Attention is all you need. Advances in NIPS, 30, 2017.
- M2tr: Multi-modal multi-scale transformers for deepfake detection. In ICMR, pages 615–623, 2022.
- Deep convolutional pooling transformer for deepfake detection. ACM Transactions on Multimedia Computing, Communications and Applications, 19(6):1–20, 2023.
- RelGAN: Multi-domain image-to-image translation via relative attributes. In ICCV, pages 5914–5922, 2019.
- Istvt: interpretable spatial-temporal video transformer for deepfake detection. IEEE Transactions on Information Forensics and Security, 18:1335–1348, 2023.
- Pluralistic image completion. In CVPR, pages 1438–1447, 2019.
- Weakly-supervised deepfake localization in diffusion-generated images. In WACV, pages 6258–6268, 2024.