Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
119 tokens/sec
GPT-4o
56 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Asymmetric Bias in Text-to-Image Generation with Adversarial Attacks (2312.14440v3)

Published 22 Dec 2023 in cs.LG and cs.CR

Abstract: The widespread use of Text-to-Image (T2I) models in content generation requires careful examination of their safety, including their robustness to adversarial attacks. Despite extensive research on adversarial attacks, the reasons for their effectiveness remain underexplored. This paper presents an empirical study on adversarial attacks against T2I models, focusing on analyzing factors associated with attack success rates (ASR). We introduce a new attack objective - entity swapping using adversarial suffixes and two gradient-based attack algorithms. Human and automatic evaluations reveal the asymmetric nature of ASRs on entity swap: for example, it is easier to replace "human" with "robot" in the prompt "a human dancing in the rain." with an adversarial suffix, but the reverse replacement is significantly harder. We further propose probing metrics to establish indicative signals from the model's beliefs to the adversarial ASR. We identify conditions that result in a success probability of 60% for adversarial attacks and others where this likelihood drops below 5%.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (52)
  1. Synthesizing robust adversarial examples. In International conference on machine learning, pages 284–293. PMLR.
  2. Improving image generation with better captions.
  3. Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901.
  4. Adversarial patch. arXiv preprint arXiv:1712.09665.
  5. Defending against alignment-breaking attacks via robustly aligned llm.
  6. Reproducible scaling laws for contrastive language-image learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 2818–2829.
  7. Vicuna: An open-source chatbot impressing gpt-4 with 90%* chatgpt quality.
  8. Jacob Cohen. 1960. A coefficient of agreement for nominal scales. Educational and psychological measurement, 20(1):37–46.
  9. How robust is google’s bard to adversarial image attacks?
  10. Hotflip: White-box adversarial examples for text classification. arXiv preprint arXiv:1712.06751.
  11. Stanislav Fort. 2023. Scaling laws for adversarial attacks on language model activations. arXiv preprint arXiv:2312.02780.
  12. Understanding clip robustness.
  13. Kilem L Gwet. 2014. Handbook of inter-rater reliability: The definitive guide to measuring the extent of agreement among raters. Advanced Analytics, LLC.
  14. Dan Hendrycks and Thomas G Dietterich. 2018. Benchmarking neural network robustness to common corruptions and surface variations. arXiv preprint arXiv:1807.01697.
  15. Denoising diffusion probabilistic models. Advances in neural information processing systems, 33:6840–6851.
  16. Sugarcrepe: Fixing hackable benchmarks for vision-language compositionality. arXiv preprint arXiv:2306.14610.
  17. Adversarial examples are not bugs, they are features. Advances in neural information processing systems, 32.
  18. Understanding the effectiveness of large language models in detecting security vulnerabilities.
  19. Diederik P Kingma and Max Welling. 2013. Auto-encoding variational bayes. arXiv preprint arXiv:1312.6114.
  20. Interpretable diffusion via information decomposition.
  21. Visual instruction tuning. arXiv preprint arXiv:2304.08485.
  22. Query-relevant images jailbreak large multi-modal models.
  23. Adversarial prompting for black box foundation models. arXiv preprint arXiv:2302.04237.
  24. How trustworthy are open-source llms? an assessment under malicious demonstrations shows their vulnerabilities.
  25. David A Noever and Samantha E Miller Noever. 2021. Reading isn’t believing: Adversarial attacks on multi-modal neurons. arXiv preprint arXiv:2103.10480.
  26. Unsafe diffusion: On the generation of unsafe images and hateful memes from text-to-image models. arXiv preprint arXiv:2305.13873.
  27. Learning transferable visual models from natural language supervision. In International conference on machine learning, pages 8748–8763. PMLR.
  28. Exploring the limits of transfer learning with a unified text-to-text transformer. The Journal of Machine Learning Research, 21(1):5485–5551.
  29. Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:2204.06125, 1(2):3.
  30. Red-teaming the stable diffusion safety filter. arXiv preprint arXiv:2210.04610.
  31. High-resolution image synthesis with latent diffusion models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 10684–10695.
  32. Safety-checker. 2022. Safety checker nested in stable diffusion.
  33. Photorealistic text-to-image diffusion models with deep language understanding. Advances in Neural Information Processing Systems, 35:36479–36494.
  34. Are adversarial examples inevitable? arXiv preprint arXiv:1809.02104.
  35. Plug and pray: Exploiting off-the-shelf components of multi-modal models. arXiv preprint arXiv:2307.14539.
  36. Survey of vulnerabilities in large language models revealed by adversarial attacks. arXiv preprint arXiv:2310.10844.
  37. Autoprompt: Eliciting knowledge from language models with automatically generated prompts. arXiv preprint arXiv:2010.15980.
  38. Genetic algorithms. Springer.
  39. Why do universal adversarial attacks work on large language models?: Geometry might be the answer. arXiv preprint arXiv:2309.00254.
  40. Richard S Sutton and Andrew G Barto. 2018. Reinforcement learning: An introduction. MIT press.
  41. Intriguing properties of neural networks. arXiv preprint arXiv:1312.6199.
  42. Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971.
  43. Ring-a-bell! how reliable are concept removal methods for diffusion models? arXiv preprint arXiv:2310.10012.
  44. Instructta: Instruction-tuned targeted attack for large vision-language models.
  45. The generative ai paradox:” what it can create, it may not understand”. arXiv preprint arXiv:2311.00059.
  46. Mma-diffusion: Multimodal attack on diffusion models. arXiv preprint arXiv:2311.17516.
  47. Sneakyprompt: Evaluating robustness of text-to-image generative models’ safety filters. arXiv preprint arXiv:2305.12082.
  48. Vlattack: Multimodal adversarial attacks on vision-language tasks via pre-trained models.
  49. When and why vision-language models behave like bag-of-words models, and what to do about it? arXiv preprint arXiv:2210.01936.
  50. Adversarial prompt tuning for vision-language models.
  51. A pilot study of query-free adversarial attack against stable diffusion. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 2384–2391.
  52. Universal and transferable adversarial attacks on aligned language models. arXiv preprint arXiv:2307.15043.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (4)
  1. Haz Sameen Shahgir (9 papers)
  2. Xianghao Kong (12 papers)
  3. Greg Ver Steeg (95 papers)
  4. Yue Dong (61 papers)
Citations (4)

Summary

We haven't generated a summary for this paper yet.

X Twitter Logo Streamline Icon: https://streamlinehq.com