Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
38 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Position: Towards Implicit Prompt For Text-To-Image Models (2403.02118v4)

Published 4 Mar 2024 in cs.CY, cs.AI, and cs.CV

Abstract: Recent text-to-image (T2I) models have had great success, and many benchmarks have been proposed to evaluate their performance and safety. However, they only consider explicit prompts while neglecting implicit prompts (hint at a target without explicitly mentioning it). These prompts may get rid of safety constraints and pose potential threats to the applications of these models. This position paper highlights the current state of T2I models toward implicit prompts. We present a benchmark named ImplicitBench and conduct an investigation on the performance and impacts of implicit prompts with popular T2I models. Specifically, we design and collect more than 2,000 implicit prompts of three aspects: General Symbols, Celebrity Privacy, and Not-Safe-For-Work (NSFW) Issues, and evaluate six well-known T2I models' capabilities under these implicit prompts. Experiment results show that (1) T2I models are able to accurately create various target symbols indicated by implicit prompts; (2) Implicit prompts bring potential risks of privacy leakage for T2I models. (3) Constraints of NSFW in most of the evaluated T2I models can be bypassed with implicit prompts. We call for increased attention to the potential and risks of implicit prompts in the T2I community and further investigation into the capabilities and impacts of implicit prompts, advocating for a balanced approach that harnesses their benefits while mitigating their risks.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (48)
  1. Hrs-bench: Holistic, reliable and scalable benchmark for text-to-image models. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp.  20041–20053, 2023.
  2. Multifusion: Fusing pre-trained models for multi-lingual, multi-modal image generation. arXiv preprint arXiv:2305.15296, 2023.
  3. Improving image generation with better captions. Computer Science. https://cdn. openai. com/papers/dall-e-3. pdf, 2:3, 2023.
  4. Civitai. Civitai: The home of open-source generative ai, 2024. URL https://civitai.com. Accessed: 2024-01-20.
  5. CompVis. GitHub Merge: [Safety Checker] Add Safety Checker Module. https://github.com/CompVis/stable-diffusion/commit/d0c714ae4afa1c011269a956d6f260f84f77025e, 2022. Accessed 29/09/2022.
  6. Generative adversarial networks: An overview. IEEE signal processing magazine, 35(1):53–65, 2018.
  7. Arcface: Additive angular margin loss for deep face recognition. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp.  4690–4699, 2019.
  8. Cogview2: Faster and better text-to-image generation via hierarchical transformers. Advances in Neural Information Processing Systems, 35:16890–16902, 2022.
  9. Make-a-scene: Scene-based text-to-image generation with human priors. In European Conference on Computer Vision, pp.  89–106. Springer, 2022.
  10. An image is worth one word: Personalizing text-to-image generation using textual inversion. arXiv preprint arXiv:2208.01618, 2022.
  11. Google. Bard, 2024. URL https://blog.google/technology/ai/bard-google-ai-search-updates/.
  12. Clipscore: A reference-free evaluation metric for image captioning. arXiv preprint arXiv:2104.08718, 2021.
  13. Gans trained by a two time-scale update rule converge to a local nash equilibrium. Advances in neural information processing systems, 30, 2017.
  14. Denoising diffusion probabilistic models. Advances in neural information processing systems, 33:6840–6851, 2020.
  15. T2i-compbench: A comprehensive benchmark for open-world compositional text-to-image generation. arXiv preprint arXiv:2307.06350, 2023.
  16. Langley, P. Crafting papers on machine learning. In Langley, P. (ed.), Proceedings of the 17th International Conference on Machine Learning (ICML 2000), pp.  1207–1216, Stanford, CA, 2000. Morgan Kaufmann.
  17. Holistic evaluation of text-to-image models. arXiv preprint arXiv:2311.04287, 2023.
  18. Lexica. Lexica, 2024. URL https://lexica.art. Accessed: 2024-01-20.
  19. Microsoft coco: Common objects in context. In Computer Vision–ECCV 2014: 13th European Conference, Zurich, Switzerland, September 6-12, 2014, Proceedings, Part V 13, pp.  740–755. Springer, 2014.
  20. Sphinx: The joint mixing of weights, tasks, and visual embeddings for multi-modal large language models. arXiv preprint arXiv:2311.07575, 2023.
  21. Visual instruction tuning. arXiv preprint arXiv:2304.08485, 2023.
  22. Stable bias: Analyzing societal representations in diffusion models. arXiv preprint arXiv:2303.11408, 2023.
  23. Machine Vision & Learning Group LMU. Safety checker model card. https://huggingface.co/CompVis/stable-diffusion-safety-checker, 2022. Accessed 20/01/2024.
  24. Midjourney, I. Midjourney, 2023. URL https://www.midjourney.com. Accessed: 2024-01-20.
  25. OpenAI. Gpt-4v(ision) system card. OpenAI, 2023. URL https://api.semanticscholar.org/CorpusID:263218031.
  26. OpenAI. Dall·e 3 system card, 2023. OpenAI (2023a).
  27. OpenAI. Openai content pilicy, 2024. URL https://openai.com/policies/usage-policies. Accessed: 2024-01-20.
  28. Human evaluation of text-to-image models on a multi-task benchmark. arXiv preprint arXiv:2211.12112, 2022.
  29. Sdxl: Improving latent diffusion models for high-resolution image synthesis. arXiv preprint arXiv:2307.01952, 2023.
  30. Unsafe diffusion: On the generation of unsafe images and hateful memes from text-to-image models. arXiv preprint arXiv:2305.13873, 2023.
  31. Learning transferable visual models from natural language supervision. In International Conference on Machine Learning, pp.  8748–8763. PMLR, 2021.
  32. Zero-shot text-to-image generation. In International Conference on Machine Learning, pp.  8821–8831. PMLR, 2021.
  33. Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:2204.06125, 1(2):3, 2022.
  34. Red-teaming the stable diffusion safety filter. arXiv preprint arXiv:2210.04610, 2022.
  35. High-resolution image synthesis with latent diffusion models. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp.  10684–10695, 2022.
  36. Photorealistic text-to-image diffusion models with deep language understanding. Advances in Neural Information Processing Systems, 35:36479–36494, 2022.
  37. Improved techniques for training gans. Advances in neural information processing systems, 29, 2016.
  38. Safe latent diffusion: Mitigating inappropriate degeneration in diffusion models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.  22522–22531, 2023.
  39. The bias amplification paradox in text-to-image generation. arXiv preprint arXiv:2308.00755, 2023.
  40. Deep unsupervised learning using nonequilibrium thermodynamics. In International conference on machine learning, pp.  2256–2265. PMLR, 2015.
  41. Deep learning face representation by joint identification-verification. Advances in neural information processing systems, 27, 2014.
  42. Df-gan: A simple and effective baseline for text-to-image synthesis. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.  16515–16525, 2022.
  43. Eyes wide shut? exploring the visual shortcomings of multimodal llms. arXiv preprint arXiv:2401.06209, 2024.
  44. The caltech-ucsd birds-200-2011 dataset. 2011.
  45. Imagereward: Learning and evaluating human preferences for text-to-image generation. arXiv preprint arXiv:2304.05977, 2023.
  46. Attngan: Fine-grained text to image generation with attentional generative adversarial networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp.  1316–1324, 2018.
  47. Joint face detection and alignment using multitask cascaded convolutional networks. IEEE signal processing letters, 23(10):1499–1503, 2016.
  48. Siren’s song in the ai ocean: A survey on hallucination in large language models. arXiv preprint arXiv:2309.01219, 2023.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (10)
  1. Yue Yang (146 papers)
  2. Hong Liu (394 papers)
  3. Wenqi Shao (89 papers)
  4. Runjian Chen (20 papers)
  5. Hailong Shang (1 paper)
  6. Yu Wang (939 papers)
  7. Yu Qiao (563 papers)
  8. Kaipeng Zhang (73 papers)
  9. Ping Luo (340 papers)
  10. Yuqi Lin (10 papers)
Citations (3)