Hacc-Man: An Arcade Game for Jailbreaking LLMs (2405.15902v1)
Abstract: The recent leaps in complexity and fluency of LLMs mean that, for the first time in human history, people can interact with computers using natural language alone. This creates monumental possibilities of automation and accessibility of computing, but also raises severe security and safety threats: When everyone can interact with LLMs, everyone can potentially break into the systems running LLMs. All it takes is creative use of language. This paper presents Hacc-Man, a game which challenges its players to "jailbreak" an LLM: subvert the LLM to output something that it is not intended to. Jailbreaking is at the intersection between creative problem solving and LLM security. The purpose of the game is threefold: 1. To heighten awareness of the risks of deploying fragile LLMs in everyday systems, 2. To heighten people's self-efficacy in interacting with LLMs, and 3. To discover the creative problem solving strategies, people deploy in this novel context.
- Albert Bandura. 1990. Perceived self-efficacy in the exercise of control over AIDS infection. Evaluation and program planning 13, 1 (1990), 9–17.
- Nicholas Caporusso et al. 2023. Generative Artificial Intelligence and the Emergence of Creative Displacement Anxiety. Research Directs in Psychology and Behavior 3, 1 (2023).
- Jonathan Cohen. 2023. Right on Track: NVIDIA Open-Source Software Helps Developers Add Guardrails to AI Chatbots — blogs.nvidia.com. https://blogs.nvidia.com/blog/ai-chatbot-guardrails-nemo/. [Accessed 10-04-2024].
- Assessing language model deployment with risk cards. arXiv preprint arXiv:2303.18190 (2023).
- Jeanette Falk and Nanna Inie. 2022. Materializing the abstract: Understanding AI by game jamming. Frontiers in Computer Science 4 (2022). https://doi.org/10.3389/fcomp.2022.959351
- Viktor Gecas. 1989. The social psychology of self-efficacy. Annual review of sociology 15, 1 (1989), 291–316.
- Riley Goodside. 2023. An unobtrusive image. https://twitter.com/goodside/status/1713041557589311863. [Accessed 10-04-2024].
- Alternate uses. (1978).
- Designing participatory ai: Creative professionals’ worries and expectations about generative ai. In Extended Abstracts of the 2023 CHI Conference on Human Factors in Computing Systems. 1–8.
- Summon a demon and bind it: A grounded theory of llm red teaming in the wild. arXiv preprint arXiv:2311.06237 (2023).
- Scott G. Isaksen. 2023. Developing Creative Potential: The Power of Process, People, and Place. Journal of Advanced Academics 34, 2 (2023), 111–144. https://doi.org/10.1177/1932202X231156389 arXiv:https://doi.org/10.1177/1932202X231156389
- Arjun Kharpal. [n. d.]. Chinese police arrest man who allegedly used ChatGPT to spread fake news in first case of its kind — cnbc.com. https://www.cnbc.com/2023/05/09/chinese-police-arrest-man-who-allegedly-used-chatgpt-to-spread-fake-news.html. [Accessed 11-04-2024].
- Jonah Lehrer. 2008. The eureka hunt. The New Yorker 28 (2008), 40–45.
- Against The Achilles’ Heel: A Survey on Red Teaming for Generative Models. arXiv preprint arXiv:2404.00629 (2024).
- Naming unrelated words predicts creativity. Proceedings of the National Academy of Sciences 118, 25 (2021), e2022340118.
- Creativity in the age of generative AI. Nat Hum Behav 7 (2023), 1836–1838. https://doi.org/10.1038/s41562-023-01751-1
- A StrongREJECT for Empty Jailbreaks. arXiv preprint arXiv:2402.10260 (2024).
- Joe Vest and James Tubberville. 2020. Red Team Development and Operations–A practical Guide. Independently Published (2020).
- Simon Willison. 2023. Prompt injection: What’s the worst that can happen? — simonwillison.net. https://simonwillison.net/2023/Apr/14/worst-that-can-happen/. [Accessed 10-04-2024].
- Simon Willison. 2024. Prompt injection and jailbreaking are not the same thing — simonwillison.net. https://simonwillison.net/2024/Mar/5/prompt-injection-jailbreaking/. [Accessed 10-04-2024].
- A New Era in LLM Security: Exploring Security Concerns in Real-World LLM-based Systems. arXiv preprint arXiv:2402.18649 (2024).
- Don’t Listen To Me: Understanding and Exploring Jailbreak Prompts of Large Language Models. arXiv preprint arXiv:2403.17336 (2024).
- Why Johnny can’t prompt: how non-AI experts try (and fail) to design LLM prompts. In Proceedings of the 2023 CHI Conference on Human Factors in Computing Systems. 1–21.
- Matheus Valentim (2 papers)
- Jeanette Falk (6 papers)
- Nanna Inie (9 papers)