Can't say cant? Measuring and Reasoning of Dark Jargons in Large Language Models (2405.00718v1)
Abstract: Ensuring the resilience of LLMs against malicious exploitation is paramount, with recent focus on mitigating offensive responses. Yet, the understanding of cant or dark jargon remains unexplored. This paper introduces a domain-specific Cant dataset and CantCounter evaluation framework, employing Fine-Tuning, Co-Tuning, Data-Diffusion, and Data-Analysis stages. Experiments reveal LLMs, including ChatGPT, are susceptible to cant bypassing filters, with varying recognition accuracy influenced by question types, setups, and prompt clues. Updated models exhibit higher acceptance rates for cant queries. Moreover, LLM reactions differ across domains, e.g., reluctance to engage in racism versus LGBT topics. These findings underscore LLMs' understanding of cant and reflect training data characteristics and vendor approaches to sensitive topics. Additionally, we assess LLMs' ability to demonstrate reasoning capabilities. Access to our datasets and code is available at https://github.com/cistineup/CantCounter.
- OpenAI. https://openai.com/chatgpt.
- Partha Pratim Ray. Chatgpt: A comprehensive review on background, applications, key challenges, bias, ethics, limitations and future scope. Internet of Things and Cyber-Physical Systems, 2023.
- Generating phishing attacks using chatgpt. arXiv preprint arXiv:2305.05133, 2023.
- Taxonomy of risks posed by language models. In Proceedings of the 2022 ACM Conference on Fairness, Accountability, and Transparency, pages 214–229, 2022.
- Linking artificial intelligence facilitated academic misconduct to existing prevention frameworks. International Journal for Educational Integrity, 19(1):20, 2023.
- Multi-step jailbreaking privacy attacks on chatgpt. arXiv preprint arXiv:2304.05197, 2023.
- From chatgpt to threatgpt: Impact of generative ai in cybersecurity and privacy. IEEE Access, 2023.
- OpenAI platform. https://platform.openai.com/docs/guides/moderation/overview.
- Reading thieves’ cant: automatically identifying and understanding dark jargons from cybercrime marketplaces. In 27th USENIX Security Symposium (USENIX Security 18), pages 1027–1041, 2018.
- Can chatgpt replace traditional kbqa models? an in-depth analysis of the question answering performance of the gpt llm family. In International Semantic Web Conference, pages 348–367. Springer, 2023.
- David Rozado. The political biases of chatgpt. Social Sciences, 12(3):148, 2023.
- Reddit. https://www.reddit.com/.
- 4chan community. https://www.4chan.org/.
- Decoding chatgpt: A taxonomy of existing research, current challenges, and possible future directions. Journal of King Saud University-Computer and Information Sciences, page 101675, 2023.
- Training language models to follow instructions with human feedback. Advances in Neural Information Processing Systems, 35:27730–27744, 2022.
- Adversarial demonstration attacks on large language models. arXiv preprint arXiv:2305.14950, 2023.
- Why so toxic? measuring and triggering toxic behavior in open-domain chatbots. In Proceedings of the 2022 ACM SIGSAC Conference on Computer and Communications Security, pages 2659–2673, 2022.
- Gpt-4 is too smart to be safe: Stealthy chat with llms via cipher. arXiv preprint arXiv:2308.06463, 2023.
- Zhang Li Feng Xiaojin. Development trend and identification path of drug-related cryptic language under the background of ”internet plus”. Journal of Political Science and Law, 38(107-118), 2021.
- Marc Sourdot. Argot, jargon, jargot. Langue française, (90):13–27, 1991.
- Slangsd: building, expanding and using a sentiment dictionary of slang words for short-text sentiment classification. Language Resources and Evaluation, 52:839–852, 2018.
- Qu Yanbin. Grammar summary of chinese folk secret language (lingo) (part 1). Cultural Journal, (26-33), 2014.
- Building task-oriented dialogue systems for online shopping. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 31, 2017.
- A dataset of information-seeking questions and answers anchored in research papers. arXiv preprint arXiv:2105.03011, 2021.
- Qa dataset explosion: A taxonomy of nlp resources for question answering and reading comprehension. ACM Computing Surveys, 55(10):1–45, 2023.
- X Corp. https://drugabuse.com/addiction/list-street-names-drugs/.
- X. https://twitter.com/.
- Kaggle. https://www.kaggle.com.
- Hugging Face. https://huggingface.co/.
- Cyber threat intelligence modeling based on heterogeneous graph convolutional network. In 23rd international symposium on research in attacks, intrusions and defenses (RAID 2020), pages 241–256, 2020.
- EverybodyWiki Bios & Wiki. https://en.everybodywiki.com/List˙of˙nicknames˙of˙Donald˙Trump.
- Defining Wellness. https://definingwellness.com/resources/drug-slang-word-glossary/.
- A Gun Lingo Glossary for Those Unfamiliar With Firearms. https://lifehacker.com/a-gun-lingo-glossary-for-those-unfamiliar-with-firearms-1825427596.
- The Racial Slur Database. http://www.rsdb.org/races.
- Wikipedia. https://en.wikipedia.org/wiki/LGBT˙slang.
- Attention is all you need. Advances in neural information processing systems, 30, 2017.
- Li Zhong and Zilong Wang. A study on robustness and reliability of large language model code generation. arXiv preprint arXiv:2308.10335, 2023.
- GPT-4. https://openai.com/research/gpt-4.
- Bard-Google. https://bard.google.com/.
- NewBing. https://www.bing.com/new.
- SparkDesk Xunfei-Xinghuo. https://xinghuo.xfyun.cn/.
- Google. https://claude.ai/.
- ERNIE. https://yiyan.baidu.com/welcome.
- ERNIE Protection Rule. https://wanhua.baidu.com/talk/protectionrule.