Identifying and Mitigating the Security Risks of Generative AI (2308.14840v4)
Abstract: Every major technical invention resurfaces the dual-use dilemma -- the new technology has the potential to be used for good as well as for harm. Generative AI (GenAI) techniques, such as LLMs and diffusion models, have shown remarkable capabilities (e.g., in-context learning, code-completion, and text-to-image generation and editing). However, GenAI can be used just as well by attackers to generate new attacks and increase the velocity and efficacy of existing attacks. This paper reports the findings of a workshop held at Google (co-organized by Stanford University and the University of Wisconsin-Madison) on the dual-use dilemma posed by GenAI. This paper is not meant to be comprehensive, but is rather an attempt to synthesize some of the interesting findings from the workshop. We discuss short-term and long-term goals for the community on this topic. We hope this paper provides both a launching point for a discussion on this important topic as well as interesting problems that the research community can work to address.
- * * *. Your AI model might be telling you this is not a cat. https://art-demo.mybluemix.net/.
- * * *. Authors Guild letter seeks compensation from AI companies for using authors’ writings in AI. https://chatgptiseatingtheworld.com/2023/07/19/authors-guild-letter-seeks-compensation-from-ai-companies-for-using-authors-writings-in-ai/, 2023.
- * * *. ChaosGPT: Empowering GPT with internet and memory to destroy humanity. https://www.youtube.com/watch?v=g7YJIpkk7KM, 2023.
- * * *. Securing the future of GenAI: Mitigating security risks. https://sites.google.com/view/genai-risk-workshop, 2023.
- Mark S. Ackerman. The intellectual challenge of CSCW: The gap between social requirements and technical feasibility. Human-Computer Interaction, 15(2):179–203, 2000.
- Generating sentiment-preserving fake online reviews using neural language models and their human-and machine-based detection. In Advanced Information Networking and Applications: Proceedings of the 34th International Conference on Advanced Information Networking and Applications (AINA-2020), pages 1341–1354. Springer, 2020.
- Roberto Gozalo-Brizuela Alejo José G. Sison, Marco Tulio Daza and Eduardo C. Garrido-Merchán. Chatgpt: More than a “weapon of mass deception” ethical challenges and responses from the human-centered artificial intelligence (hcai) perspective. International Journal of Human–Computer Interaction, 0(0):1–20, 2023.
- Detecting language model attacks with perplexity, 2023.
- Natural language watermarking: Design, analysis, and a proof-of-concept implementation. In Information Hiding: 4th International Workshop, IH 2001 Pittsburgh, PA, USA, April 25–27, 2001 Proceedings 4, pages 185–200. Springer, 2001.
- Constitutional AI: Harmlessness from AI feedback. https://arxiv.org/abs/2212.08073, 2022.
- Real or fake? learning to discriminate machine from human generated text. https://arxiv.org/abs/1906.03351, 2019.
- Ashley Belanger. OpenAI, Google will watermark AI-generated content to hinder deepfakes, misinfo. https://arstechnica.com/ai/2023/07/openai-google-will-watermark-ai-generated-content-to-hinder-deepfakes-misinfo/, 2023.
- Molly Bohannon. Lawyer used ChatGPT in court—and cited fake cases. a judge is considering sanctions. https://www.forbes.com/sites/mollybohannon/2023/06/08/lawyer-used-chatgpt-in-court-and-cited-fake-cases-a-judge-is-considering-sanctions/?sh=1175e6e87c7f, 2023.
- ChemCrow: Augmenting large-language models with chemistry tools. https://arxiv.org/abs/2304.05376, 2023.
- Blake Brittain. Lawsuit says openai violated us authors’ copyrights to train ai chatbot. https://www.reuters.com/legal/lawsuit-says-openai-violated-us-authors-copyrights-train-ai-chatbot-2023-06-29/, 2023.
- Miriam Buiten. Product liability for defective AI. https://ssrn.com/abstract=4515202, 2023.
- Poisoning web-scale training datasets is practical. https://arxiv.org/abs/2302.10149, 2023.
- Are aligned neural networks adversarially aligned? https://arxiv.org/abs/2306.15447, 2023.
- Poisoning and backdooring contrastive learning. https://arxiv.org/abs/2106.09667, 2022.
- Extracting training data from large language models. In USENIX Security Symposium, 2020.
- Undetectable watermarks for language models. Cryptology ePrint Archive, Paper 2023/763, https://eprint.iacr.org/2023/763, 2023.
- DARPA Public Affairs. DARPA announces research teams selected to semantic forensics program. https://www.darpa.mil/news-events/2021-03-02, 2021.
- Unmasking deepfakes with simple features. https://arxiv.org/abs/1911.00686, 2019.
- TweepFake: About detecting deepfake tweets. https://arxiv.org/abs/2008.00036, 2020.
- The stable signature: Rooting watermarks in latent diffusion models. https://arxiv.org/abs/2303.15435, 2023.
- Leveraging frequency analysis for deep fake image recognition. In International Conference on Machine Learning, pages 3247–3258. PMLR, 2020.
- Iason Gabriel. Artificial intelligence, values, and alignment. Minds and Machines, 30(3):411–437, 2020.
- How the eu can take on the challenge posed by general-purposeai systems. https://assets.mofoprod.net/network/documents/AI-Act_Mozilla-GPAI-Brief_Kx1ktuk.pdf, 2022.
- Red teaming language models to reduce harms: Methods, scaling behaviors, and lessons learned. https://arxiv.org/abs/2209.07858, 2022.
- RARR: Researching and revising what language models say, using language models. https://arxiv.org/abs/2210.08726, 2023.
- RealToxicityPrompts: Evaluating neural toxic degeneration in language models. In Findings of the Association for Computational Linguistics: EMNLP 2020, pages 3356–3369, Online, 2020.
- GLTR: Statistical detection and visualization of generated text. https://arxiv.org/abs/1906.04043, 2019.
- LLM censorship: A machine learning challenge or a computer security problem? https://arxiv.org/abs/2307.10719, 2023.
- Google. Bard. https://bard.google.com/, 2023.
- Deepfake detection by analyzing convolutional traces. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition workshops, pages 666–667, 2020.
- Large language models for code: Security hardening and adversarial testing. https://arxiv.org/abs/2302.05319, 2023.
- TRUE: Re-evaluating factual consistency evaluation. https://arxiv.org/abs/2204.04991, 2022.
- The importance of modeling social factors of language: Theory and practice. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 588–602, Online, 2021.
- Automatic detection of generated text is easiest when humans are fooled. https://arxiv.org/abs/1911.00650, 2019.
- Tuning models of code with compiler-generated reinforcement learning feedback. https://arxiv.org/abs/2305.18341, 2023.
- Automatic detection of machine generated text: A critical survey. https://arxiv.org/abs/2011.01314, 2020.
- Evading watermark based detection of AI-generated content. https://arxiv.org/abs/2305.03807, 2023.
- Exploiting programmatic behavior of llms: Dual-use through standard security attacks. arXiv preprint arXiv:2302.05733, 2023.
- Information Hiding. Artech House, 2016.
- Daniel Kelley. WormGPT – the Generative AI tool cybercriminals are using to launch business email compromise attacks. https://slashnext.com/blog/wormgpt-the-generative-ai-tool-cybercriminals-are-using-to-launch-business-email-compromise-attacks/, 2023.
- A watermark for large language models. https://arxiv.org/abs/2301.10226, 2023.
- On the reliability of watermarks for large language models. https://arxiv.org/abs/2306.04634, 2023.
- Paraphrasing evades detectors of AI-generated text, but retrieval is an effective defense. https://arxiv.org/abs/2303.13408, 2023.
- Rakesh Krishnan. FraudGPT: The villain avatar of ChatGPT. https://netenrich.com/blog/fraudgpt-the-villain-avatar-of-chatgpt, 2023.
- Robust distortion-free watermarks for language models, 2023.
- GPT detectors are biased against non-native english writers. https://arxiv.org/abs/2304.02819, 2023.
- Rethinking model evaluation as narrowing the socio-technical gap. https://arxiv.org/abs/2306.03100, 2023.
- Roberta: A robustly optimized bert pretraining approach. https://arxiv.org/abs/1907.11692, 2019.
- Global texture enhancement for fake face detection in the wild. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 8060–8069, 2020.
- PTW: Pivotal tuning watermarking for pre-trained image generators. 32nd USENIX Security Symposium, 2023.
- Self-Refine: Iterative refinement with self-feedback. https://arxiv.org/abs/2303.17651, 2023.
- Makyen. Temporary policy: Generative AI (e.g., ChatGPT) is banned. https://meta.stackoverflow.com/questions/421831/temporary-policy-generative-ai-e-g-chatgpt-is-banned, 2022.
- Detection of gan-generated fake images over social networks. In 2018 IEEE conference on multimedia information processing and retrieval (MIPR), pages 384–389. IEEE, 2018.
- Incremental learning for the detection and classification of gan-generated images. In 2019 IEEE international workshop on information forensics and security (WIFS), pages 1–6. IEEE, 2019.
- Detecting GAN-generated imagery using color cues. https://arxiv.org/abs/1812.08247, 2018.
- Teaching language models to support answers with verified quotes. https://arxiv.org/abs/2203.11147, 2022.
- Meta. Introducing Llama 2. https://ai.meta.com/llama/, 2023.
- Midjourney. Midjourney. https://www.midjourney.com/, 2023.
- Midjourney. Stable diffusion. https://stablediffusionweb.com/, 2023.
- DetectGPT: Zero-shot machine-generated text detection using probability curvature. https://arxiv.org/abs/2301.11305, 2023.
- Detecting gan generated fake images using co-occurrence matrices. Electronic Imaging, 2019(5):532–1, 2019.
- Long sequence modeling with xgen: A 7b llm trained on 8k input sequence length. https://blog.salesforceairesearch.com/xgen/, 2023.
- Towards universal fake image detectors that generalize across generative models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 24480–24489, 2023.
- OpenAI. GPT-2: 1.5b release. https://openai.com/research/gpt-2-1-5b-release, 2019.
- OpenAI. ChatGPT: Optimizing language models for dialogue. https://openai.com/blog/chatgpt/, 2022.
- OpenAI. DALL-E: Creating images from text. https://openai.com/research/dall-e, 2023.
- OpenAI. GPT-4 is OpenAI’s most advanced system, producing safer and more useful responses. https://openai.com/gpt-4, 2023.
- OpenAI. GPT-4 technical report. https://cdn.openai.com/papers/gpt-4.pdf, 2023.
- Training language models to follow instructions with human feedback. https://arxiv.org/abs/2203.02155, 2022.
- Asleep at the keyboard? Assessing the security of GitHub Copilot’s code contributions. https://arxiv.org/abs/2108.09293, 2021.
- Red teaming language models with language models. In Conference on Empirical Methods in Natural Language Processing, 2022.
- In-context retrieval-augmented language models. https://arxiv.org/abs/2302.00083, 2023.
- Measuring attribution in natural language generation models. https://arxiv.org/abs/2112.12870, 2022.
- Towards the detection of diffusion model deepfakes. https://arxiv.org/abs/2210.14571, 2023.
- Can AI-generated text be reliably detected? https://arxiv.org/abs/2303.11156, 2023.
- Mark Sellman. My AI: Snapchat chatbot coaches ‘girl, 13’ on losing virginity. https://www.thetimes.co.uk/article/my-ai-snapchat-chatbot-coaches-girl-13-on-losing-virginity-dj7p6268b, 2023.
- DE-FAKE: Detection and attribution of fake images generated by text-to-image generation models. https://arxiv.org/abs/2210.06998, 2023.
- The curse of recursion: Training on generated data makes models forget. https://arxiv.org/abs/2305.17493, 2023.
- Release strategies and the social impacts of language models. https://arxiv.org/abs/1908.09203, 2019.
- Process for adapting language models to society (PALMS) with values-targeted datasets. https://arxiv.org/abs/2106.10328, 2021.
- Learning to summarize from human feedback. https://arxiv.org/abs/2009.01325, 2022.
- Data feedback loops: Model-driven amplification of dataset biases. In Andreas Krause, Emma Brunskill, Kyunghyun Cho, Barbara Engelhardt, Sivan Sabato, and Jonathan Scarlett, editors, Proceedings of the 40th International Conference on Machine Learning, volume 202 of Proceedings of Machine Learning Research, pages 33883–33920. PMLR, 2023.
- The White House. FACT SHEET: Biden–Harris administration announces national cyber workforce and education strategy, unleashing America’s cyber talent. https://www.whitehouse.gov/briefing-room/statements-releases/2023/07/31/fact-sheet-biden-harris-administration-announces-national-cyber-workforce-and-education-strategy-unleashing-americas-cyber-talent/, 2023.
- Detecting child sexual abuse material shouldn’t be done at any cost. https://www.euronews.com/2023/07/04/detecting-child-sexual-abuse-material-shouldnt-be-done-at-any-cost, 2023.
- On the robustness of ChatGPT: An adversarial and out-of-distribution perspective. https://arxiv.org/abs/2302.12095, 2023.
- CNN-generated images are surprisingly easy to spot… for now. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 8695–8704, 2020.
- Emergent abilities of large language models. https://arxiv.org/abs/2206.07682, 2022.
- Max Weiss. Deepfake bot submissions to federal public comment websites cannot be distinguished from human submissions. Technology Science, 2019121801, 2019.
- Challenges in detoxifying language models. https://arxiv.org/abs/2109.07445, 2021.
- Wikipedia contributors. Dual-use technology — Wikipedia, the free encyclopedia. https://en.wikipedia.org/w/index.php?title=Dual-use_technology&oldid=1167047934 [Online; accessed 3-August-2023], 2023.
- Linguistic steganography on Twitter: hierarchical language modeling with manual interaction. In Adnan M. Alattar, Nasir D. Memon, and Chad D. Heitzenrater, editors, Media Watermarking, Security, and Forensics 2014, volume 9028, pages 9–25. International Society for Optics and Photonics, SPIE, 2014.
- Chloe Xiang. “He would still be here”: Man dies by suicide after talking with AI chatbot, widow says. https://www.vice.com/en/article/pkadgm/man-dies-by-suicide-after-talking-with-ai-chatbot-widow-says, 2023.
- Recipes for safety in open-domain chatbots. https://arxiv.org/abs/2010.07079, 2021.
- ReAct: Synergizing reasoning and acting in language models. https://arxiv.org/abs/2210.03629, 2023.
- BERTScore: Evaluating text generation with BERT. https://arxiv.org/abs/1904.09675, 2020.
- Detecting and simulating artifacts in gan fake images. In 2019 IEEE International Workshop on Information Forensics and Security (WIFS), pages 1–6. IEEE, 2019.
- Men also like shopping: Reducing gender bias amplification using corpus-level constraints. https://arxiv.org/abs/1707.09457, 2017.
- Provable robust watermarking for AI-generated text. https://arxiv.org/abs/2306.17439, 2023.
- Protecting language generation models via invisible watermarking. https://arxiv.org/abs/2302.03162, 2023.
- PromptBench: Towards evaluating the robustness of large language models on adversarial prompts. https://arxiv.org/abs/2306.04528, 2023.
- Can large language models transform computational social science? https://arxiv.org/abs/2305.03514, 2023.
- Universal and transferable adversarial attacks on aligned language models. https://arxiv.org/abs/2307.15043, 2023.