Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
41 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

CLIPping the Deception: Adapting Vision-Language Models for Universal Deepfake Detection (2402.12927v1)

Published 20 Feb 2024 in cs.CV

Abstract: The recent advancements in Generative Adversarial Networks (GANs) and the emergence of Diffusion models have significantly streamlined the production of highly realistic and widely accessible synthetic content. As a result, there is a pressing need for effective general purpose detection mechanisms to mitigate the potential risks posed by deepfakes. In this paper, we explore the effectiveness of pre-trained vision-LLMs (VLMs) when paired with recent adaptation methods for universal deepfake detection. Following previous studies in this domain, we employ only a single dataset (ProGAN) in order to adapt CLIP for deepfake detection. However, in contrast to prior research, which rely solely on the visual part of CLIP while ignoring its textual component, our analysis reveals that retaining the text part is crucial. Consequently, the simple and lightweight Prompt Tuning based adaptation strategy that we employ outperforms the previous SOTA approach by 5.01% mAP and 6.61% accuracy while utilizing less than one third of the training data (200k images as compared to 720k). To assess the real-world applicability of our proposed models, we conduct a comprehensive evaluation across various scenarios. This involves rigorous testing on images sourced from 21 distinct datasets, including those generated by GANs-based, Diffusion-based and Commercial tools.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (52)
  1. Flamingo: a visual language model for few-shot learning. Advances in Neural Information Processing Systems 35 (2022), 23716–23736.
  2. Quentin Bammey. 2023. Synthbuster: Towards detection of diffusion model generated images. IEEE Open Journal of Signal Processing (2023).
  3. Large scale GAN training for high fidelity natural image synthesis. arXiv preprint arXiv:1809.11096 (2018).
  4. Efficient geometry-aware 3D generative adversarial networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 16123–16133.
  5. Self-supervised learning of adversarial example: Towards good generalizations for deepfake detection. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 18710–18719.
  6. Stargan: Unified generative adversarial networks for multi-domain image-to-image translation. In Proceedings of the IEEE conference on computer vision and pattern recognition. 8789–8797.
  7. Fakecatcher: Detection of synthetic portrait videos using biological signals. IEEE transactions on pattern analysis and machine intelligence (2020).
  8. On the detection of synthetic images generated by diffusion models. In ICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 1–5.
  9. Raising the Bar of AI-generated Image Detection with CLIP. arXiv preprint arXiv:2312.00195 (2023).
  10. DALL·E Mini. https://doi.org/10.5281/zenodo.5146400
  11. Prafulla Dhariwal and Alexander Nichol. 2021. Diffusion models beat gans on image synthesis. Advances in neural information processing systems 34 (2021), 8780–8794.
  12. Taming transformers for high-resolution image synthesis. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 12873–12883.
  13. Clip-adapter: Better vision-language models with feature adapters. International Journal of Computer Vision (2023), 1–15.
  14. Generative adversarial nets. Advances in neural information processing systems 27 (2014).
  15. Are GAN generated images easy to detect? A critical analysis of the state-of-the-art. In 2021 IEEE international conference on multimedia and expo (ICME). IEEE, 1–6.
  16. Improved training of wasserstein gans. Advances in neural information processing systems 30 (2017).
  17. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition. 770–778.
  18. Scaling up visual and vision-language representation learning with noisy text supervision. In International conference on machine learning. PMLR, 4904–4916.
  19. Visual prompt tuning. In European Conference on Computer Vision. Springer, 709–727.
  20. Progressive growing of gans for improved quality, stability, and variation. arXiv preprint arXiv:1710.10196 (2017).
  21. Training generative adversarial networks with limited data. Advances in neural information processing systems 33 (2020), 12104–12114.
  22. Alias-free generative adversarial networks. Advances in Neural Information Processing Systems 34 (2021), 852–863.
  23. A style-based generator architecture for generative adversarial networks. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 4401–4410.
  24. Analyzing and improving the image quality of stylegan. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 8110–8119.
  25. Sohail Ahmed Khan and Duc-Tien Dang-Nguyen. 2023. Deepfake Detection: Analysing Model Generalisation Across Architectures, Datasets and Pre-Training Paradigms. IEEE Access (2023).
  26. How to adapt your large-scale vision-and-language model. (2021).
  27. Gradient-based learning applied to document recognition. Proc. IEEE 86, 11 (1998), 2278–2324.
  28. Pre-train, prompt, and predict: A systematic survey of prompting methods in natural language processing. Comput. Surveys 55, 9 (2023), 1–35.
  29. Yisroel Mirsky and Wenke Lee. 2021. The creation and detection of deepfakes: A survey. ACM Computing Surveys (CSUR) 54, 1 (2021), 1–41.
  30. Glide: Towards photorealistic image generation and editing with text-guided diffusion models. arXiv preprint arXiv:2112.10741 (2021).
  31. Conditional image synthesis with auxiliary classifier gans. In International conference on machine learning. PMLR, 2642–2651.
  32. Towards universal fake image detectors that generalize across generative models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 24480–24489.
  33. OpenAI. 2021a. guided-diffusion. https://github.com/openai/guided-diffusion.
  34. OpenAI. 2021b. Introducing ChatGPT. https://openai.com/blog/chatgpt. [Online; accessed 08-August-2023].
  35. Semantic image synthesis with spatially-adaptive normalization. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 2337–2346.
  36. Sdxl: Improving latent diffusion models for high-resolution image synthesis. arXiv preprint arXiv:2307.01952 (2023).
  37. Learning transferable visual models from natural language supervision. In International conference on machine learning. PMLR, 8748–8763.
  38. Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:2204.06125 1, 2 (2022), 3.
  39. High-resolution image synthesis with latent diffusion models. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 10684–10695.
  40. Faceforensics++: Learning to detect manipulated facial images. In Proceedings of the IEEE/CVF international conference on computer vision. 1–11.
  41. Photorealistic text-to-image diffusion models with deep language understanding. Advances in Neural Information Processing Systems 35 (2022), 36479–36494.
  42. Improved techniques for training gans. Advances in neural information processing systems 29 (2016).
  43. Laion-400m: Open dataset of clip-filtered 400 million image-text pairs. arXiv preprint arXiv:2111.02114 (2021).
  44. Deep unsupervised learning using nonequilibrium thermodynamics. In International conference on machine learning. PMLR, 2256–2265.
  45. CNN-generated images are surprisingly easy to spot… for now. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 8695–8704.
  46. High-resolution image synthesis and semantic manipulation with conditional gans. In Proceedings of the IEEE conference on computer vision and pattern recognition. 8798–8807.
  47. DIRE for Diffusion-Generated Image Detection. arXiv preprint arXiv:2303.09295 (2023).
  48. Robust fine-tuning of zero-shot models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 7959–7971.
  49. Lsun: Construction of a large-scale image dataset using deep learning with humans in the loop. arXiv preprint arXiv:1506.03365 (2015).
  50. Learning to prompt for vision-language models. International Journal of Computer Vision 130, 9 (2022), 2337–2348.
  51. Unpaired image-to-image translation using cycle-consistent adversarial networks. In Proceedings of the IEEE international conference on computer vision. 2223–2232.
  52. Face forgery detection by 3d decomposition. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 2929–2939.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (2)
  1. Sohail Ahmed Khan (10 papers)
  2. Duc-Tien Dang-Nguyen (11 papers)
Citations (9)
X Twitter Logo Streamline Icon: https://streamlinehq.com