Papers
Topics
Authors
Recent
2000 character limit reached

FakeBench: Probing Explainable Fake Image Detection via Large Multimodal Models (2404.13306v2)

Published 20 Apr 2024 in cs.CV and cs.MM

Abstract: The ability to distinguish whether an image is generated by AI is a crucial ingredient in human intelligence, usually accompanied by a complex and dialectical forensic and reasoning process. However, current fake image detection models and databases focus on binary classification without understandable explanations for the general populace. This weakens the credibility of authenticity judgment and may conceal potential model biases. Meanwhile, large multimodal models (LMMs) have exhibited immense visual-text capabilities on various tasks, bringing the potential for explainable fake image detection. Therefore, we pioneer the probe of LMMs for explainable fake image detection by presenting a multimodal database encompassing textual authenticity descriptions, the FakeBench. For construction, we first introduce a fine-grained taxonomy of generative visual forgery concerning human perception, based on which we collect forgery descriptions in human natural language with a human-in-the-loop strategy. FakeBench examines LMMs with four evaluation criteria: detection, reasoning, interpretation and fine-grained forgery analysis, to obtain deeper insights into image authenticity-relevant capabilities. Experiments on various LMMs confirm their merits and demerits in different aspects of fake image detection tasks. This research presents a paradigm shift towards transparency for the fake image detection area and reveals the need for greater emphasis on forensic elements in visual-language research and AI risk control. FakeBench will be available at https://github.com/Yixuan423/FakeBench.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (84)
  1. 2024. Dalle3 reddit dataset. https://huggingface.co/datasets/ProGamerGov/dalle-3-reddit-dataset
  2. 2024a. midjourney. https://www.midjourney.com/home
  3. 2024b. midjourney-v5 prompt dataset. https://huggingface.co/datasets/tarungupta83/MidJourney_v5_Prompt_dataset
  4. GPT-4 technical report. arXiv preprint arXiv:2303.08774 (2023).
  5. Interpretable-through-prototypes deepfake detection for diffusion models. In ICCVW. 467–474.
  6. Eirikur Agustsson and Radu Timofte. 2017. Ntire 2017 challenge on single image super-resolution: Dataset and study. In CVPRW. IEEE, 126–135.
  7. Anthrop. [n. d.]. Introducing the next generation of Claude. https://www.anthropic.com/news/claude-3-family. 2024.03.
  8. Qwen-VL: A versatile vision-language model for understanding, localization, text Reading, and beyond. arXiv preprint arXiv:2308.12966 (2023).
  9. CNN detection of GAN-generated face images based on cross-band co-occurrences analysis. In WIFS. IEEE, 1–6.
  10. Improving image generation with better captions. Computer Science 2, 3 (2023), 8.
  11. Ali Borji. 2023. Qualitative failures of image generation models and their application in detecting deepfakes. Image Vision Comput. 137, C (2023), 21 pages.
  12. Sparks of artificial general intelligence: Early experiments with GPT-4. arXiv preprint arXiv:2303.12712 (2023).
  13. BenchLMM: Benchmarking cross-style visual capability of large multimodal models. arXiv preprint arXiv:2312.02896 (2023).
  14. Comprehension skill, inference-making ability, and their relation to knowledge. Memory & Cognition 29, 6 (2001), 850–859.
  15. What makes fake images detectable? understanding properties that generalize. In ECCV. Springer, 103–120.
  16. AntifakePrompt: Prompt-tuned vision-language models are fake image detectors. arXiv preprint arXiv:2310.17419 (2023).
  17. Crawling the internal knowledge-base of language models. In ACL. 1856–1869.
  18. InstructBLIP: Towards general-purpose vision-language models with instruction tuning. In NeurIPS.
  19. ImageNet: A large-scale hierarchical image database. In CVPR. IEEE, 248–255.
  20. Cogview2: Faster and better text-to-image generation via hierarchical transformers. In NeurIPS. 16890–16902.
  21. InternLM-XComposer2: Mastering free-form text-image composition and comprehension in vision-language large model. arXiv preprint arXiv:2401.16420 (2024).
  22. GLM: General language model pretraining with autoregressive blank infilling. In ACL. 320–335.
  23. Online detection of AI-generated images. In ICCVW. 382–392.
  24. Leveraging frequency analysis for deep fake image recognition. In ICML. PMLR, 3247–3258.
  25. MME: A comprehensive evaluation benchmark for multimodal large language models. arXiv preprint arXiv:2306.13394 (2024).
  26. Dreamsim: Learning new dimensions of human visual similarity using synthetic data. arXiv preprint arXiv:2306.09344 (2023).
  27. Chain-of-thought hub: A continuous effort to measure large language models’ reasoning performance. arXiv preprint arXiv:2305.17306 (2023).
  28. OpenAGI: When LLM meets domain experts. In NeurIPS, Vol. 36.
  29. Are GAN generated images easy to detect? A critical analysis of the state-of-the-art. In ICME. IEEE, 1–6.
  30. Vector quantized diffusion model for text-to-image synthesis. In CVPR. IEEE, 10696–10706.
  31. Evading DeepFake detectors via adversarial statistical consistency. In CVPR. IEEE, 12271–12280.
  32. AesBench: An Expert Benchmark for Multimodal Large Language Models on Image Aesthetics Perception. arXiv preprint arXiv:2401.08276 (2024).
  33. GLFF: Global and local feature fusion for AI-synthesized image detection. IEEE Trans. Multimedia 26 (2024), 4073–4085.
  34. Fusing global and local features for generalized ai-synthesized image detection. In ICIP. IEEE, 3465–3469.
  35. Progressive growing of gans for improved quality, stability, and variation. arXiv preprint arXiv:1710.10196 (2017).
  36. A style-based generator architecture for generative adversarial networks. In CVRP. IEEE, 4401–4410.
  37. OBELICS: An open web-scale filtered dataset of interleaved image-text documents. arXiv preprint arXiv:2306.16527 (2023).
  38. SEED-Bench: Benchmarking multimodal LLMs with generative comprehension. arXiv preprint arXiv:2307.16125 (2023).
  39. Otter: A multi-modal model with in-context instruction tuning. arXiv preprint arXiv:2305.03726 (2023).
  40. AGIQA-3K: An open database for ai-generated image quality assessment. IEEE Trans. Circuits Syst. Video Technol. (2023).
  41. Chin-Yew Lin. 2004. Rouge: A package for automatic evaluation of summaries. In ACL. 74–81.
  42. Detecting multimedia generated by large AI models: A survey. arXiv preprint arXiv:2402.00045 (2024).
  43. Detecting generated images by real images. In ECCV. Springer, 95–110.
  44. Visual instruction tuning. In NeurIPS, Vol. 36.
  45. FuseDream: Training-free text-to-image generation with improved clip+ gan space optimization. arXiv preprint arXiv:2112.01573 (2021).
  46. MMBench: Is your multi-modal model an all-around player? arXiv preprint arXiv:2307.06281 (2023).
  47. Global texture enhancement for fake face detection in the wild. In CVPR. IEEE, 8060–8069.
  48. Learn to explain: Multimodal reasoning via thought chains for science question answering. In NeurIPS. 2507–2521.
  49. Glide: Towards photorealistic image generation and editing with text-guided diffusion models. arXiv preprint arXiv:2112.10741 (2021).
  50. Towards universal fake image detectors that generalize across generative models. In CVPR. 24480–24489.
  51. Bleu: A method for automatic evaluation of machine translation. In ACL. 311–318.
  52. Kosmos-2: Grounding multimodal large language models to the world. arXiv preprint arXiv: 2306.14824 (2023).
  53. Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:2204.06125 (2022).
  54. Nils Reimers and Iryna Gurevych. 2019. Sentence-BERT: Sentence embeddings using siamese bert-networks. In EMNLP. 3982–3992.
  55. High-resolution image synthesis with latent diffusion models. In CVPR. IEEE, 10684–10695.
  56. Learning to retrieve prompts for in-context learning. In ACL. 2655–2671.
  57. De-fake: Detection and attribution of fake images generated by text-to-image generation models. In Proceedings of the 2023 ACM SIGSAC Conference on Computer and Communications Security. 3418–3432.
  58. ChEF: A comprehensive evaluation framework for standardized assessment of multimodal large language models. arXiv preprint arXiv:2311.02692 (2023).
  59. CommonsenseQA: A question answering challenge targeting commonsense knowledge. In ACL. 4149–4158.
  60. Learning on gradients: Generalized artifacts representation for GAN-generated images detection. In CVPR. IEEE, 12105–12114.
  61. Gemini Team. 2024. Gemini: A family of highly capable multimodal models. arXiv preprint arXiv:2312.11805 (2024).
  62. CNN-generated images are surprisingly easy to spot… for now. In CVPR. IEEE, 8695–8704.
  63. DIRE for diffusion-generated image detection. In ICCV. 22388–22398.
  64. DiffusionDB: A large-scale prompt gallery dataset for text-to-image generative models. In ACL. 893–911.
  65. Chain-of-thought prompting elicits reasoning in large language models. In NeurIPS. 24824–24837.
  66. Wavelet-packets for deepfake image analysis and detection. Mach Learn. 111 (2022), 4295–4327.
  67. Cheap-fake detection with LLM using prompt engineering. In ICMEW. IEEE, 105–109.
  68. Q-bench: A benchmark for general-purpose foundation models on low-level vision. arXiv preprint arXiv:2309.14181 (2023).
  69. A comprehensive study of multimodal large language models for image quality Assessment. arXiv preprint arXiv:2403.10854 (2024).
  70. Human preference score v2: A solid benchmark for evaluating human preferences of text-to-image synthesis. arXiv preprint arXiv:2306.09341 (2023).
  71. A survey of human-in-the-loop for machine learning. Future Gener. Comp. Sy. 135 (2022), 364–381.
  72. LVLM-eHub: A comprehensive evaluation benchmark for large vision-language models. arXiv preprint arXiv:2306.09265 (2023).
  73. Exposing fake images generated by text-to-image diffusion models. Pattern Recogn. Lett. 176 (2023), 76–82.
  74. mPLUG-Owl2: Revolutionizing multi-modal large language model with modality collaboration. arXiv preprint arXiv:2311.04257 (2023).
  75. LAMM: Language-assisted multi-modal instruction-tuning dataset, framework, and benchmark. In NeurIPS. 26650–26685.
  76. PKU-I2IQA: An image-to-image quality assessment database for ai generated images. arXiv preprint arXiv:2311.15556 (2023).
  77. Detecting and simulating artifacts in gan fake images. In WIFS. IEEE, 1–6.
  78. Automatic chain of thought prompting in large language models. In EMNLP. 12113–12139.
  79. Multimodal chain-of-thought reasoning in language models. arXiv preprint arXiv:2302.00923 (2023).
  80. Explainability for large language models: A survey. ACM Trans. Intell. Syst. Technol. 15, 2 (2024), 1–38.
  81. Judging LLM-as-a-judge with mt-bench and chatbot arena. NeurIPS, 46595–46623.
  82. Rich and poor texture contrast: A simple yet effective approach for AI-generated image detection. arXiv preprint arXiv:2311.12397v2 (2023).
  83. Minigpt-4: Enhancing vision-language understanding with advanced large language models. arXiv preprint arXiv:2304.10592 (2023).
  84. Genimage: A million-scale benchmark for detecting ai-generated image. In NeurIPS.

Summary

We haven't generated a summary for this paper yet.

Dice Question Streamline Icon: https://streamlinehq.com

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Lightbulb Streamline Icon: https://streamlinehq.com

Continue Learning

We haven't generated follow-up questions for this paper yet.

List To Do Tasks Checklist Streamline Icon: https://streamlinehq.com

Collections

Sign up for free to add this paper to one or more collections.

X Twitter Logo Streamline Icon: https://streamlinehq.com

Tweets

This paper has been mentioned in 1 tweet and received 0 likes.

Upgrade to Pro to view all of the tweets about this paper: