Risk Taxonomy, Mitigation, and Assessment Benchmarks of Large Language Model Systems (2401.05778v1)
Abstract: LLMs have strong capabilities in solving diverse natural language processing tasks. However, the safety and security issues of LLM systems have become the major obstacle to their widespread application. Many studies have extensively investigated risks in LLM systems and developed the corresponding mitigation strategies. Leading-edge enterprises such as OpenAI, Google, Meta, and Anthropic have also made lots of efforts on responsible LLMs. Therefore, there is a growing need to organize the existing studies and establish comprehensive taxonomies for the community. In this paper, we delve into four essential modules of an LLM system, including an input module for receiving prompts, a LLM trained on extensive corpora, a toolchain module for development and deployment, and an output module for exporting LLM-generated content. Based on this, we propose a comprehensive taxonomy, which systematically analyzes potential risks associated with each module of an LLM system and discusses the corresponding mitigation strategies. Furthermore, we review prevalent benchmarks, aiming to facilitate the risk assessment of LLM systems. We hope that this paper can help LLM participants embrace a systematic perspective to build their responsible LLM systems.
- T. B. Brown, B. Mann, N. Ryder, M. Subbiah, J. Kaplan, P. Dhariwal, A. Neelakantan, P. Shyam, G. Sastry, A. Askell, S. Agarwal, A. Herbert-Voss, G. Krueger, T. Henighan, R. Child, A. Ramesh, D. M. Ziegler, J. Wu, C. Winter, C. Hesse, M. Chen, E. Sigler, M. Litwin, S. Gray, B. Chess, J. Clark, C. Berner, S. McCandlish, A. Radford, I. Sutskever, and D. Amodei, “Language models are few-shot learners,” in NeurIPS, 2020.
- OpenAI, “GPT-4 technical report,” CoRR, vol. abs/2303.08774, 2023.
- H. Touvron, T. Lavril, G. Izacard, X. Martinet, M. Lachaux, T. Lacroix, B. Rozière, N. Goyal, E. Hambro, F. Azhar, A. Rodriguez, A. Joulin, E. Grave, and G. Lample, “Llama: Open and efficient foundation language models,” CoRR, vol. abs/2302.13971, 2023.
- H. Touvron, L. Martin, K. Stone, P. Albert, A. Almahairi, Y. Babaei, N. Bashlykov, S. Batra, P. Bhargava, S. Bhosale, D. Bikel, L. Blecher, C. Canton-Ferrer, M. Chen, G. Cucurull, D. Esiobu, J. Fernandes, J. Fu, W. Fu, B. Fuller, C. Gao, V. Goswami, N. Goyal, A. Hartshorn, S. Hosseini, R. Hou, H. Inan, M. Kardas, V. Kerkez, M. Khabsa, I. Kloumann, A. Korenev, P. S. Koura, M. Lachaux, T. Lavril, J. Lee, D. Liskovich, Y. Lu, Y. Mao, X. Martinet, T. Mihaylov, P. Mishra, I. Molybog, Y. Nie, A. Poulton, J. Reizenstein, R. Rungta, K. Saladi, A. Schelten, R. Silva, E. M. Smith, R. Subramanian, X. E. Tan, B. Tang, R. Taylor, A. Williams, J. X. Kuan, P. Xu, Z. Yan, I. Zarov, Y. Zhang, A. Fan, M. Kambadur, S. Narang, A. Rodriguez, R. Stojnic, S. Edunov, and T. Scialom, “Llama 2: Open foundation and fine-tuned chat models,” CoRR, vol. abs/2307.09288, 2023.
- A. Zeng, X. Liu, Z. Du, Z. Wang, H. Lai, M. Ding, Z. Yang, Y. Xu, W. Zheng, X. Xia, W. L. Tam, Z. Ma, Y. Xue, J. Zhai, W. Chen, Z. Liu, P. Zhang, Y. Dong, and J. Tang, “GLM-130B: an open bilingual pre-trained model,” in ICLR, 2023.
- Y. Wang, H. Le, A. Gotmare, N. D. Q. Bui, J. Li, and S. C. H. Hoi, “Codet5+: Open code large language models for code understanding and generation,” in EMNLP, 2023, pp. 1069–1088.
- S. Ye, H. Hwang, S. Yang, H. Yun, Y. Kim, and M. Seo, “In-context instruction learning,” CoRR, vol. abs/2302.14691, 2023.
- J. Wei, X. Wang, D. Schuurmans, M. Bosma, B. Ichter, F. Xia, E. H. Chi, Q. V. Le, and D. Zhou, “Chain-of-thought prompting elicits reasoning in large language models,” in NeurIPS, 2022.
- S. Yao, D. Yu, J. Zhao, I. Shafran, T. L. Griffiths, Y. Cao, and K. Narasimhan, “Tree of thoughts: Deliberate problem solving with large language models,” CoRR, vol. abs/2305.10601, 2023.
- M. Besta, N. Blach, A. Kubicek, R. Gerstenberger, L. Gianinazzi, J. Gajda, T. Lehmann, M. Podstawski, H. Niewiadomski, P. Nyczyk, and T. Hoefler, “Graph of thoughts: Solving elaborate problems with large language models,” CoRR, vol. abs/2308.09687, 2023.
- L. Ouyang, J. Wu, X. Jiang, D. Almeida, C. L. Wainwright, P. Mishkin, C. Zhang, S. Agarwal, K. Slama, A. Ray, J. Schulman, J. Hilton, F. Kelton, L. Miller, M. Simens, A. Askell, P. Welinder, P. F. Christiano, J. Leike, and R. Lowe, “Training language models to follow instructions with human feedback,” in NeurIPS, 2022.
- OpenAI, “Introducing chatgpt,” https://openai.com/blog/chatgpt, 2022.
- ——, “March 20 chatgpt outage: Here’s what happened,” https://openai.com/blog/march-20-chatgpt-outage, 2023.
- X. Shen, Z. Chen, M. Backes, Y. Shen, and Y. Zhang, “”do anything now”: Characterizing and evaluating in-the-wild jailbreak prompts on large language models,” CoRR, vol. abs/2308.03825, 2023.
- Y. Wang, Y. Pan, M. Yan, Z. Su, and T. H. Luan, “A survey on chatgpt: Ai-generated contents, challenges, and solutions,” IEEE Open J. Comput. Soc., vol. 4, pp. 280–302, 2023.
- B. Wang, W. Chen, H. Pei, C. Xie, M. Kang, C. Zhang, C. Xu, Z. Xiong, R. Dutta, R. Schaeffer, S. T. Truong, S. Arora, M. Mazeika, D. Hendrycks, Z. Lin, Y. Cheng, S. Koyejo, D. Song, and B. Li, “Decodingtrust: A comprehensive assessment of trustworthiness in GPT models,” CoRR, vol. abs/2306.11698, 2023.
- Y. Liu, Y. Yao, J. Ton, X. Zhang, R. Guo, H. Cheng, Y. Klochkov, M. F. Taufiq, and H. Li, “Trustworthy llms: a survey and guideline for evaluating large language models’ alignment,” CoRR, vol. abs/2308.05374, 2023.
- M. Gupta, C. Akiri, K. Aryal, E. Parker, and L. Praharaj, “From chatgpt to threatgpt: Impact of generative AI in cybersecurity and privacy,” IEEE Access, vol. 11, pp. 80 218–80 245, 2023.
- X. Huang, W. Ruan, W. Huang, G. Jin, Y. Dong, C. Wu, S. Bensalem, R. Mu, Y. Qi, X. Zhao, K. Cai, Y. Zhang, S. Wu, P. Xu, D. Wu, A. Freitas, and M. A. Mustafa, “A survey of safety and trustworthiness of large language models through the lens of verification and validation,” CoRR, vol. abs/2305.11391, 2023.
- OpenAI, “Developing safe & responsible ai,” https://openai.com/safety, 2022.
- Google, “Introducing gemini: our largest and most capable ai model,” https://blog.google/technology/ai/google-gemini-ai/#introducing-gemini, 2023.
- Meta, “Llama 2 - responsible user guide,” https://github.com/facebookresearch/llama/blob/main/Responsible-Use-Guide.pdf, 2023.
- Anthropic, “Ai research and products that put safety at the frontier,” https://www.anthropic.com/, 2023.
- W. X. Zhao, K. Zhou, J. Li, T. Tang, X. Wang, Y. Hou, Y. Min, B. Zhang, J. Zhang, Z. Dong, Y. Du, C. Yang, Y. Chen, Z. Chen, J. Jiang, R. Ren, Y. Li, X. Tang, Z. Liu, P. Liu, J. Nie, and J. Wen, “A survey of large language models,” CoRR, vol. abs/2303.18223, 2023.
- M. F. Medress, F. S. Cooper, J. W. Forgie, C. C. Green, D. H. Klatt, M. H. O’Malley, E. P. Neuburg, A. Newell, R. Reddy, H. B. Ritea, J. E. Shoup-Hummel, D. E. Walker, and W. A. Woods, “Speech understanding systems,” Artif. Intell., vol. 9, no. 3, pp. 307–316, 1977.
- A. Fan, M. Lewis, and Y. N. Dauphin, “Hierarchical neural story generation,” in ACL, 2018, pp. 889–898.
- A. Holtzman, J. Buys, L. Du, M. Forbes, and Y. Choi, “The curious case of neural text degeneration,” in ICLR, 2020.
- A. Peng, M. Wu, J. Allard, L. Kilpatrick, and S. Heidel, “Gpt-3.5 turbo fine-tuning and api updates,” https://openai.com/blog/gpt-3-5-turbo-fine-tuning-and-api-updates, 2023.
- OpenAI, “Model index for researchers,” https://platform.openai.com/docs/model-index-for-researchers, 2023.
- J. Kaplan, S. McCandlish, T. Henighan, T. B. Brown, B. Chess, R. Child, S. Gray, A. Radford, J. Wu, and D. Amodei, “Scaling laws for neural language models,” CoRR, vol. abs/2001.08361, 2020.
- A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser, and I. Polosukhin, “Attention is all you need,” in NeurIPS, 2017, pp. 5998–6008.
- J. Yang, H. Jin, R. Tang, X. Han, Q. Feng, H. Jiang, B. Yin, and X. Hu, “Harnessing the power of llms in practice: A survey on chatgpt and beyond,” CoRR, vol. abs/2304.13712, 2023.
- R. Rafailov, A. Sharma, E. Mitchell, S. Ermon, C. D. Manning, and C. Finn, “Direct preference optimization: Your language model is secretly a reward model,” CoRR, vol. abs/2305.18290, 2023.
- F. Song, B. Yu, M. Li, H. Yu, F. Huang, Y. Li, and H. Wang, “Preference ranking optimization for human alignment,” CoRR, vol. abs/2306.17492, 2023.
- Z. Yuan, H. Yuan, C. Tan, W. Wang, S. Huang, and F. Huang, “RRHF: rank responses to align language models with human feedback without tears,” CoRR, vol. abs/2304.05302, 2023.
- Y. Zhao, M. Khalman, R. Joshi, S. Narayan, M. Saleh, and P. J. Liu, “Calibrating sequence likelihood improves conditional language generation,” in ICLR, 2023.
- H. Liu, C. Sferrazza, and P. Abbeel, “Chain of hindsight aligns language models with feedback,” CoRR, vol. abs/2302.02676, 2023.
- R. Liu, C. Jia, G. Zhang, Z. Zhuang, T. X. Liu, and S. Vosoughi, “Second thoughts are best: Learning to re-align with human values from text edits,” in NeurIPS, 2022.
- R. Liu, R. Yang, C. Jia, G. Zhang, D. Zhou, A. M. Dai, D. Yang, and S. Vosoughi, “Training socially aligned language models in simulated human society,” CoRR, vol. abs/2305.16960, 2023.
- R. Taylor, M. Kardas, G. Cucurull, T. Scialom, A. Hartshorn, E. Saravia, A. Poulton, V. Kerkez, and R. Stojnic, “Galactica: A large language model for science,” CoRR, vol. abs/2211.09085, 2022.
- S. Black, S. Biderman, E. Hallahan, Q. Anthony, L. Gao, L. Golding, H. He, C. Leahy, K. McDonell, J. Phang, M. Pieler, U. S. Prashanth, S. Purohit, L. Reynolds, J. Tow, B. Wang, and S. Weinbach, “Gpt-neox-20b: An open-source autoregressive language model,” CoRR, vol. abs/2204.06745, 2022.
- A. Chowdhery, S. Narang, J. Devlin, M. Bosma, G. Mishra, A. Roberts, P. Barham, H. W. Chung, C. Sutton, S. Gehrmann, P. Schuh, K. Shi, S. Tsvyashchenko, J. Maynez, A. Rao, P. Barnes, Y. Tay, N. Shazeer, V. Prabhakaran, E. Reif, N. Du, B. Hutchinson, R. Pope, J. Bradbury, J. Austin, M. Isard, G. Gur-Ari, P. Yin, T. Duke, A. Levskaya, S. Ghemawat, S. Dev, H. Michalewski, X. Garcia, V. Misra, K. Robinson, L. Fedus, D. Zhou, D. Ippolito, D. Luan, H. Lim, B. Zoph, A. Spiridonov, R. Sepassi, D. Dohan, S. Agrawal, M. Omernick, A. M. Dai, T. S. Pillai, M. Pellat, A. Lewkowycz, E. Moreira, R. Child, O. Polozov, K. Lee, Z. Zhou, X. Wang, B. Saeta, M. Diaz, O. Firat, M. Catasta, J. Wei, K. Meier-Hellstern, D. Eck, J. Dean, S. Petrov, and N. Fiedel, “Palm: Scaling language modeling with pathways,” J. Mach. Learn. Res., vol. 24, pp. 240:1–240:113, 2023.
- E. Nijkamp, B. Pang, H. Hayashi, L. Tu, H. Wang, Y. Zhou, S. Savarese, and C. Xiong, “Codegen: An open large language model for code with multi-turn program synthesis,” in ICLR, 2023.
- J. D. Lafferty, A. McCallum, and F. C. N. Pereira, “Conditional random fields: Probabilistic models for segmenting and labeling sequence data,” in ICML, 2001, pp. 282–289.
- C. Zhou, P. Liu, P. Xu, S. Iyer, J. Sun, Y. Mao, X. Ma, A. Efrat, P. Yu, L. Yu, S. Zhang, G. Ghosh, M. Lewis, L. Zettlemoyer, and O. Levy, “LIMA: less is more for alignment,” CoRR, vol. abs/2305.11206, 2023.
- Y. Bai, A. Jones, K. Ndousse, A. Askell, A. Chen, N. DasSarma, D. Drain, S. Fort, D. Ganguli, T. Henighan, N. Joseph, S. Kadavath, J. Kernion, T. Conerly, S. E. Showk, N. Elhage, Z. Hatfield-Dodds, D. Hernandez, T. Hume, S. Johnston, S. Kravec, L. Lovitt, N. Nanda, C. Olsson, D. Amodei, T. B. Brown, J. Clark, S. McCandlish, C. Olah, B. Mann, and J. Kaplan, “Training a helpful and harmless assistant with reinforcement learning from human feedback,” CoRR, vol. abs/2204.05862, 2022.
- P. F. Christiano, J. Leike, T. B. Brown, M. Martic, S. Legg, and D. Amodei, “Deep reinforcement learning from human preferences,” in NeurIPS, 2017, pp. 4299–4307.
- J. W. Rae, S. Borgeaud, T. Cai, K. Millican, J. Hoffmann, H. F. Song, J. Aslanides, S. Henderson, R. Ring, S. Young, E. Rutherford, T. Hennigan, J. Menick, A. Cassirer, R. Powell, G. van den Driessche, L. A. Hendricks, M. Rauh, P. Huang, A. Glaese, J. Welbl, S. Dathathri, S. Huang, J. Uesato, J. Mellor, I. Higgins, A. Creswell, N. McAleese, A. Wu, E. Elsen, S. M. Jayakumar, E. Buchatskaya, D. Budden, E. Sutherland, K. Simonyan, M. Paganini, L. Sifre, L. Martens, X. L. Li, A. Kuncoro, A. Nematzadeh, E. Gribovskaya, D. Donato, A. Lazaridou, A. Mensch, J. Lespiau, M. Tsimpoukelli, N. Grigorev, D. Fritz, T. Sottiaux, M. Pajarskas, T. Pohlen, Z. Gong, D. Toyama, C. de Masson d’Autume, Y. Li, T. Terzi, V. Mikulik, I. Babuschkin, A. Clark, D. de Las Casas, A. Guy, C. Jones, J. Bradbury, M. J. Johnson, B. A. Hechtman, L. Weidinger, I. Gabriel, W. Isaac, E. Lockhart, S. Osindero, L. Rimell, C. Dyer, O. Vinyals, K. Ayoub, J. Stanway, L. Bennett, D. Hassabis, K. Kavukcuoglu, and G. Irving, “Scaling language models: Methods, analysis & insights from training gopher,” CoRR, vol. abs/2112.11446, 2021.
- J. Schulman, F. Wolski, P. Dhariwal, A. Radford, and O. Klimov, “Proximal policy optimization algorithms,” CoRR, vol. abs/1707.06347, 2017.
- H. Sun, Z. Zhang, J. Deng, J. Cheng, and M. Huang, “Safety assessment of chinese large language models,” CoRR, vol. abs/2304.10436, 2023.
- A. Albert, “Jailbreak chat,” https://www.jailbreakchat.com/, 2023.
- S. Willison, “Prompt injection attacks against gpt-3,” https://simonwillison.net/2022/Sep/12/prompt-injection/, 2023.
- P. E. Guide, “Adversarial prompting,” https://www.promptingguide.ai/risks/adversarial, 2023.
- L. Prompting, “Prompt hacking,” https://learnprompting.org/docs/prompt_hacking/leaking, 2023.
- Y. Bai, S. Kadavath, S. Kundu, A. Askell, J. Kernion, A. Jones, A. Chen, A. Goldie, A. Mirhoseini, and C. McKinnon, “Constitutional AI: harmlessness from AI feedback,” CoRR, vol. abs/2212.08073, 2022.
- R. Anil, A. M. Dai, O. Firat, M. Johnson, D. Lepikhin, A. Passos, S. Shakeri, E. Taropa, P. Bailey, and Z. C. et al., “Palm 2 technical report,” CoRR, vol. abs/2305.10403, 2023.
- F. Perez and I. Ribeiro, “Ignore previous prompt: Attack techniques for language models,” CoRR, vol. abs/2211.09527, 2022.
- K. Greshake, S. Abdelnabi, S. Mishra, C. Endres, T. Holz, and M. Fritz, “Not what you’ve signed up for: Compromising real-world llm-integrated applications with indirect prompt injection,” CoRR, vol. abs/2302.12173, 2023.
- Y. Liu, G. Deng, Y. Li, K. Wang, T. Zhang, Y. Liu, H. Wang, Y. Zheng, and Y. Liu, “Prompt injection attack against llm-integrated applications,” CoRR, vol. abs/2306.05499, 2023.
- R. Pedro, D. Castro, P. Carreira, and N. Santos, “From prompt injections to sql injection attacks: How protected is your llm-integrated web application?” CoRR, vol. abs/2308.01990, 2023.
- M. Piedrafita, “Bypass openai’s chatgpt alignment efforts with this one weird trick,” https://twitter.com/m1guelpf/status/1598203861294252033, 2022.
- D. Kang, X. Li, I. Stoica, C. Guestrin, M. Zaharia, and T. Hashimoto, “Exploiting programmatic behavior of llms: Dual-use through standard security attacks,” CoRR, vol. abs/2302.05733, 2023.
- Y. Yuan, W. Jiao, W. Wang, J. Huang, P. He, S. Shi, and Z. Tu, “GPT-4 is too smart to be safe: Stealthy chat with llms via cipher,” CoRR, vol. abs/2308.06463, 2023.
- H. Li, D. Guo, W. Fan, M. Xu, J. Huang, F. Meng, and Y. Song, “Multi-step jailbreaking privacy attacks on chatgpt,” in EMNLP, 2023, pp. 4138–4153.
- G. Deng, Y. Liu, Y. Li, K. Wang, Y. Zhang, Z. Li, H. Wang, T. Zhang, and Y. Liu, “Jailbreaker: Automated jailbreak across multiple large language model chatbots,” CoRR, vol. abs/2307.08715, 2023.
- N. Carlini, F. Tramèr, E. Wallace, M. Jagielski, A. Herbert-Voss, K. Lee, A. Roberts, T. B. Brown, D. Song, Ú. Erlingsson, A. Oprea, and C. Raffel, “Extracting training data from large language models,” in USENIX Security, 2021, pp. 2633–2650.
- J. Huang, H. Shao, and K. C. Chang, “Are large pre-trained language models leaking your personal information?” in EMNLP, 2022, pp. 2038–2047.
- F. Mireshghallah, A. Uniyal, T. Wang, D. Evans, and T. Berg-Kirkpatrick, “An empirical analysis of memorization in fine-tuned autoregressive language models,” in EMNLP, 2022, pp. 1816–1826.
- N. Lukas, A. Salem, R. Sim, S. Tople, L. Wutschitz, and S. Z. Béguelin, “Analyzing leakage of personally identifiable information in language models,” in SP, 2023, pp. 346–363.
- A. Zou, Z. Wang, J. Z. Kolter, and M. Fredrikson, “Universal and transferable adversarial attacks on aligned language models,” CoRR, vol. abs/2307.15043, 2023.
- M. Shanahan, K. McDonell, and L. Reynolds, “Role play with large language models,” Nat., vol. 623, no. 7987, pp. 493–498, 2023.
- Y. Liu, G. Deng, Z. Xu, Y. Li, Y. Zheng, Y. Zhang, L. Zhao, T. Zhang, and Y. Liu, “Jailbreaking chatgpt via prompt engineering: An empirical study,” CoRR, vol. abs/2305.13860, 2023.
- Y. Wolf, N. Wies, Y. Levine, and A. Shashua, “Fundamental limitations of alignment in large language models,” CoRR, vol. abs/2304.11082, 2023.
- A. Wei, N. Haghtalab, and J. Steinhardt, “Jailbroken: How does LLM safety training fail?” CoRR, vol. abs/2307.02483, 2023.
- B. Barak, “Another jailbreak for gpt4: Talk to it in morse code,” https://twitter.com/boazbaraktcs/status/1637657623100096513, 2023.
- N. kat, “New jailbreak based on virtual functions smuggle,” https://old.reddit.com/r/ChatGPT/comments/10urbdj/new_jailbreak_based_on_virtual_functions_smuggle/, 2023.
- Y. Zhu, R. Kiros, R. S. Zemel, R. Salakhutdinov, R. Urtasun, A. Torralba, and S. Fidler, “Aligning books and movies: Towards story-like visual explanations by watching movies and reading books,” in ICCV, 2015, pp. 19–27.
- T. H. Trinh and Q. V. Le, “A simple method for commonsense reasoning,” CoRR, vol. abs/1806.02847, 2018.
- R. Zellers, A. Holtzman, H. Rashkin, Y. Bisk, A. Farhadi, F. Roesner, and Y. Choi, “Defending against neural fake news,” in NeurIPS, 2019, pp. 9051–9062.
- J. Baumgartner, S. Zannettou, B. Keegan, M. Squire, and J. Blackburn, “The pushshift reddit dataset,” in ICWSM, 2020, pp. 830–839.
- L. Gao, S. Biderman, S. Black, L. Golding, T. Hoppe, C. Foster, J. Phang, H. He, A. Thite, and N. N. et al., “The pile: An 800gb dataset of diverse text for language modeling,” CoRR, vol. abs/2101.00027, 2021.
- H. Laurençon, L. Saulnier, T. Wang, C. Akiki, A. V. del Moral, T. L. Scao, L. von Werra, C. Mou, E. G. Ponferrada, and H. N. et al., “The bigscience ROOTS corpus: A 1.6tb composite multilingual dataset,” in NeurIPS, 2022.
- S. Kim, S. Yun, H. Lee, M. Gubri, S. Yoon, and S. J. Oh, “Propile: Probing privacy leakage in large language models,” CoRR, vol. abs/2307.01881, 2023.
- M. Fan, C. Chen, C. Wang, and J. Huang, “On the trustworthiness landscape of state-of-the-art generative models: A comprehensive survey,” CoRR, vol. abs/2307.16680, 2023.
- H. Shao, J. Huang, S. Zheng, and K. C. Chang, “Quantifying association capabilities of large language models and its implications on privacy leakage,” CoRR, vol. abs/2305.12707, 2023.
- X. Wu, R. Duan, and J. Ni, “Unveiling security, privacy, and ethical concerns of chatgpt,” CoRR, vol. abs/2307.14192, 2023.
- N. Carlini, D. Ippolito, M. Jagielski, K. Lee, F. Tramèr, and C. Zhang, “Quantifying memorization across neural language models,” in ICLR, 2023.
- F. Mireshghallah, A. Uniyal, T. Wang, D. Evans, and T. Berg-Kirkpatrick, “Memorization in NLP fine-tuning methods,” CoRR, vol. abs/2205.12506, 2022.
- M. Jagielski, O. Thakkar, F. Tramèr, D. Ippolito, K. Lee, N. Carlini, E. Wallace, S. Song, A. G. Thakurta, N. Papernot, and C. Zhang, “Measuring forgetting of memorized training examples,” in ICLR, 2023.
- S. Gehman, S. Gururangan, M. Sap, Y. Choi, and N. A. Smith, “Realtoxicityprompts: Evaluating neural toxic degeneration in language models,” in EMNLP, 2020, pp. 3356–3369.
- N. Ousidhoum, X. Zhao, T. Fang, Y. Song, and D. Yeung, “Probing toxic content in large pre-trained language models,” in ACL, 2021, pp. 4262–4274.
- O. Shaikh, H. Zhang, W. Held, M. S. Bernstein, and D. Yang, “On second thought, let’s not think step by step! bias and toxicity in zero-shot reasoning,” in ACL, 2023, pp. 4454–4470.
- S. Bordia and S. R. Bowman, “Identifying and reducing gender bias in word-level language models,” in NAACL-HLT, 2019, pp. 7–15.
- C. Wald and L. Pfahler, “Exposing bias in online communities through large-scale language models,” CoRR, vol. abs/2306.02294, 2023.
- J. Welbl, A. Glaese, J. Uesato, S. Dathathri, J. Mellor, L. A. Hendricks, K. Anderson, P. Kohli, B. Coppin, and P. Huang, “Challenges in detoxifying language models,” in EMNLP, 2021, pp. 2447–2469.
- Y. Huang, Q. Zhang, P. S. Yu, and L. Sun, “Trustgpt: A benchmark for trustworthy and responsible large language models,” CoRR, vol. abs/2306.11507, 2023.
- Y. Wang and Y. Chang, “Toxicity detection with generative prompt-based inference,” CoRR, vol. abs/2205.12390, 2022.
- J. Li, T. Du, S. Ji, R. Zhang, Q. Lu, M. Yang, and T. Wang, “Textshield: Robust text classification based on multimodal embedding and neural machine translation,” in USENIX Security, 2020, pp. 1381–1398.
- A. Deshpande, V. Murahari, T. Rajpurohit, A. Kalyan, and K. Narasimhan, “Toxicity in chatgpt: Analyzing persona-assigned language models,” CoRR, vol. abs/2304.05335, 2023.
- E. M. Smith, M. Hall, M. Kambadur, E. Presani, and A. Williams, “”i’m sorry to hear that”: Finding new biases in language models with a holistic descriptor dataset,” in EMNLP, 2022, pp. 9180–9211.
- T. Hossain, S. Dev, and S. Singh, “MISGENDERED: limits of large language models in understanding pronouns,” in ACL, 2023, pp. 5352–5367.
- M. Nadeem, A. Bethke, and S. Reddy, “Stereoset: Measuring stereotypical bias in pretrained language models,” in ACL, 2021, pp. 5356–5371.
- W. Fish, “Perception, hallucination, and illusion.” OUP USA, 2009.
- Z. Ji, N. Lee, R. Frieske, T. Yu, D. Su, Y. Xu, E. Ishii, Y. Bang, A. Madotto, and P. Fung, “Survey of hallucination in natural language generation,” ACM Comput. Surv., vol. 55, no. 12, pp. 248:1–248:38, 2023.
- Y. Zhang, Y. Li, L. Cui, D. Cai, L. Liu, T. Fu, X. Huang, E. Zhao, Y. Zhang, Y. Chen et al., “Siren’s song in the ai ocean: A survey on hallucination in large language models,” CoRR, vol. abs/2309.01219, 2023.
- L. Huang, W. Yu, W. Ma, W. Zhong, Z. Feng, H. Wang, Q. Chen, W. Peng, X. Feng, B. Qin et al., “A survey on hallucination in large language models: Principles, taxonomy, challenges, and open questions,” CoRR, vol. abs/2311.05232, 2023.
- P. Laban, W. Kryscinski, D. Agarwal, A. R. Fabbri, C. Xiong, S. Joty, and C. Wu, “Llms as factual reasoners: Insights from existing benchmarks and beyond,” CoRR, vol. abs/2305.14540, 2023.
- D. Tam, A. Mascarenhas, S. Zhang, S. Kwan, M. Bansal, and C. Raffel, “Evaluating the factual consistency of large language models through news summarization,” in Findings of ACL, 2023, pp. 5220–5255.
- J. Fan, D. Aumiller, and M. Gertz, “Evaluating factual consistency of texts with semantic role labeling,” in *SEM@ACL, 2023, pp. 89–100.
- S. Lin, J. Hilton, and O. Evans, “Truthfulqa: Measuring how models mimic human falsehoods,” in ACL, 2022, pp. 3214–3252.
- P. Hase, M. T. Diab, A. Celikyilmaz, X. Li, Z. Kozareva, V. Stoyanov, M. Bansal, and S. Iyer, “Methods for measuring, updating, and visualizing factual beliefs in language models,” in EACL, 2023, pp. 2706–2723.
- N. Lee, W. Ping, P. Xu, M. Patwary, P. Fung, M. Shoeybi, and B. Catanzaro, “Factuality enhanced language models for open-ended text generation,” in NeurIPS, 2022.
- K. Shuster, S. Poff, M. Chen, D. Kiela, and J. Weston, “Retrieval augmentation reduces hallucination in conversation,” in Findings of EMNLP, 2021, pp. 3784–3803.
- B. Peng, M. Galley, P. He, H. Cheng, Y. Xie, Y. Hu, Q. Huang, L. Liden, Z. Yu, W. Chen, and J. Gao, “Check your facts and try again: Improving large language models with external knowledge and automated feedback,” CoRR, vol. abs/2302.12813, 2023.
- X. Yue, B. Wang, Z. Chen, K. Zhang, Y. Su, and H. Sun, “Automatic evaluation of attribution by large language models,” in Findings of EMNLP, 2023, pp. 4615–4635.
- J. Xie, K. Zhang, J. Chen, R. Lou, and Y. Su, “Adaptive chameleon or stubborn sloth: Unraveling the behavior of large language models in knowledge clashes,” CoRR, vol. abs/2305.13300, 2023.
- G. Penedo, Q. Malartic, D. Hesslow, R. Cojocaru, A. Cappelli, H. Alobeidli, B. Pannier, E. Almazrouei, and J. Launay, “The refinedweb dataset for falcon LLM: outperforming curated corpora with web data, and web data only,” CoRR, vol. abs/2306.01116, 2023.
- D. Li, A. S. Rawat, M. Zaheer, X. Wang, M. Lukasik, A. Veit, F. X. Yu, and S. Kumar, “Large language models with controllable working memory,” in Findings of ACL, 2023, pp. 1774–1793.
- A. Mallen, A. Asai, V. Zhong, R. Das, D. Khashabi, and H. Hajishirzi, “When not to trust language models: Investigating effectiveness of parametric and non-parametric memories,” in ACL, 2023, pp. 9802–9822.
- K. Sun, Y. E. Xu, H. Zha, Y. Liu, and X. L. Dong, “Head-to-tail: How knowledgeable are large language models (llm)? A.K.A. will llms replace knowledge graphs?” CoRR, vol. abs/2308.10168, 2023.
- S. Zheng, J. Huang, and K. C. Chang, “Why does chatgpt fall short in answering questions faithfully?” CoRR, vol. abs/2304.10513, 2023.
- C. Kang and J. Choi, “Impact of co-occurrence on factual knowledge of large language models,” CoRR, vol. abs/2310.08256, 2023.
- S. Li, X. Li, L. Shang, Z. Dong, C. Sun, B. Liu, Z. Ji, X. Jiang, and Q. Liu, “How pre-trained language models capture factual knowledge? a causal-inspired analysis,” in Findings of ACL, 2022, pp. 1720–1732.
- K. Lee, D. Ippolito, A. Nystrom, C. Zhang, D. Eck, C. Callison-Burch, and N. Carlini, “Deduplicating training data makes language models better,” in ACL, 2022, pp. 8424–8445.
- N. Kandpal, H. Deng, A. Roberts, E. Wallace, and C. Raffel, “Large language models struggle to learn long-tail knowledge,” in ICML, 2023, p. 15696–15707.
- D. Hernandez, T. Brown, T. Conerly, N. DasSarma, D. Drain, S. El-Showk, N. Elhage, Z. Hatfield-Dodds, T. Henighan, T. Hume, S. Johnston, B. Mann, C. Olah, C. Olsson, D. Amodei, N. Joseph, J. Kaplan, and S. McCandlish, “Scaling laws and interpretability of learning from repeated data,” CoRR, vol. abs/2205.10487, 2022.
- N. McKenna, T. Li, L. Cheng, M. J. Hosseini, M. Johnson, and M. Steedman, “Sources of hallucination by large language models on inference tasks,” in Findings of EMNLP, 2023, pp. 2758–2774.
- J. W. Wei, D. Huang, Y. Lu, D. Zhou, and Q. V. Le, “Simple synthetic data reduces sycophancy in large language models,” CoRR, vol. abs/2308.03958, 2023.
- M. Sharma, M. Tong, T. Korbak, D. Duvenaud, A. Askell, S. R. Bowman, N. Cheng, E. Durmus, Z. Hatfield-Dodds, S. R. Johnston, S. Kravec, T. Maxwell, S. McCandlish, K. Ndousse, O. Rausch, N. Schiefer, D. Yan, M. Zhang, and E. Perez, “Towards understanding sycophancy in language models,” CoRR, vol. abs/2310.13548, 2023.
- M. Zhang, O. Press, W. Merrill, A. Liu, and N. A. Smith, “How language model hallucinations can snowball,” CoRR, vol. abs/2305.13534, 2023.
- A. Azaria and T. M. Mitchell, “The internal state of an LLM knows when its lying,” CoRR, vol. abs/2304.13734, 2023.
- D. Halawi, J. Denain, and J. Steinhardt, “Overthinking the truth: Understanding how language models process false demonstrations,” CoRR, vol. abs/2307.09476, 2023.
- Q. Dong, L. Li, D. Dai, C. Zheng, Z. Wu, B. Chang, X. Sun, J. Xu, L. Li, and Z. Sui, “A survey on in-context learning,” CoRR, vol. abs/2301.00234, 2023.
- C. Olsson, N. Elhage, N. Nanda, N. Joseph, N. DasSarma, T. Henighan, B. Mann, A. Askell, Y. Bai, A. Chen, T. Conerly, D. Drain, D. Ganguli, Z. Hatfield-Dodds, D. Hernandez, S. Johnston, A. Jones, J. Kernion, L. Lovitt, K. Ndousse, D. Amodei, T. Brown, J. Clark, J. Kaplan, S. McCandlish, and C. Olah, “In-context learning and induction heads,” CoRR, vol. abs/2209.11895, 2022.
- N. Dziri, A. Madotto, O. Zaïane, and A. J. Bose, “Neural path hunter: Reducing hallucination in dialogue systems via path grounding,” in EMNLP, 2021, pp. 2197–2214.
- Y. Chen, R. Guan, X. Gong, J. Dong, and M. Xue, “D-DAE: defense-penetrating model extraction attacks,” in SP, 2023, pp. 382–399.
- Y. Shen, X. He, Y. Han, and Y. Zhang, “Model stealing attacks against inductive graph neural networks,” in SP, 2022, pp. 1175–1192.
- J. Mattern, F. Mireshghallah, Z. Jin, B. Schölkopf, M. Sachan, and T. Berg-Kirkpatrick, “Membership inference attacks against language models via neighbourhood comparison,” in ACL, 2023, pp. 11 330–11 343.
- J. Zhou, Y. Chen, C. Shen, and Y. Zhang, “Property inference attacks against gans,” in NDSS, 2022.
- H. Yang, M. Ge, and K. X. andF Jingwei Li, “Using highly compressed gradients in federated learning for data reconstruction attacks,” IEEE Trans. Inf. Forensics Secur., vol. 18, pp. 818–830, 2023.
- M. Fredrikson, S. Jha, and T. Ristenpart, “Model inversion attacks that exploit confidence information and basic countermeasures,” in CCS, 2015, pp. 1322–1333.
- G. Xia, J. Chen, C. Yu, and J. Ma, “Poisoning attacks in federated learning: A survey,” IEEE Access, vol. 11, pp. 10 708–10 722, 2023.
- E. O. Soremekun, S. Udeshi, and S. Chattopadhyay, “Towards backdoor attacks and defense in robust machine learning models,” Comput. Secur., vol. 127, p. 103101, 2023.
- I. J. Goodfellow, J. Shlens, and C. Szegedy, “Explaining and harnessing adversarial examples,” in ICLR, 2015.
- I. Shumailov, Y. Zhao, D. Bates, N. Papernot, R. D. Mullins, and R. Anderson, “Sponge examples: Energy-latency attacks on neural networks,” in SP, 2021, pp. 212–231.
- W. M. Si, M. Backes, and Y. Zhang, “Mondrian: Prompt abstraction attack against large language models for cheaper api pricing,” CoRR, vol. abs/2308.03558, 2023.
- J. Shi, Y. Liu, P. Zhou, and L. Sun, “Badgpt: Exploring security vulnerabilities of chatgpt via backdoor attacks to instructgpt,” CoRR, vol. abs/2304.12298, 2023.
- J. Li, Y. Yang, Z. Wu, V. G. V. Vydiswaran, and C. Xiao, “Chatgpt as an attack tool: Stealthy textual backdoor attack via blackbox generative model trigger,” CoRR, vol. abs/2304.14475, 2023.
- J. Jia, A. Salem, M. Backes, Y. Zhang, and N. Z. Gong, “Memguard: Defending against black-box membership inference attacks via adversarial examples,” in CCS, 2019, pp. 259–274.
- J. Wang, X. Hu, W. Hou, H. Chen, R. Zheng, Y. Wang, L. Yang, H. Huang, W. Ye, and X. G. et al., “On the robustness of chatgpt: An adversarial and out-of-distribution perspective,” CoRR, vol. abs/2302.12095, 2023.
- Z. Li, C. Wang, P. Ma, C. Liu, S. Wang, D. Wu, and C. Gao, “On the feasibility of specialized ability extracting for large language code models,” CoRR, vol. abs/2303.03012, 2023.
- S. Zhao, J. Wen, A. T. Luu, J. Zhao, and J. Fu, “Prompt as triggers for backdoor attack: Examining the vulnerability in language models,” in EMNLP, 2023, pp. 12 303–12 317.
- M. Hilton, N. Nelson, T. Tunnell, D. Marinov, and D. Dig, “Trade-offs in continuous integration: assurance, security, and flexibility,” in ESEC/FSE, 2017, pp. 197–207.
- I. Koishybayev, A. Nahapetyan, R. Zachariah, S. Muralee, B. Reaves, A. Kapravelos, and A. Machiry, “Characterizing the security of github CI workflows,” in USENIX, 2022, pp. 2747–2763.
- S. Lee, H. Han, S. K. Cha, and S. Son, “Montage: A neural network language model-guided javascript engine fuzzer,” in USENIX, 2020, pp. 2613–2630.
- C. Lao, Y. Le, K. Mahajan, Y. Chen, W. Wu, A. Akella, and M. M. Swift, “ATP: in-network aggregation for multi-tenant learning,” in NSDI, 2021, pp. 741–761.
- Q. Xiao, Y. Chen, C. Shen, Y. Chen, and K. Li, “Seeing is not believing: Camouflage attacks on image scaling algorithms,” in USENIX Security, 2019, pp. 443–460.
- H. T. Maia, C. Xiao, D. Li, E. Grinspun, and C. Zheng, “Can one hear the shape of a neural network?: Snooping the GPU via magnetic side channel,” in USENIX, 2022, pp. 4383–4400.
- Y. Tobah, A. Kwong, I. Kang, D. Genkin, and K. G. Shin, “Spechammer: Combining spectre and rowhammer for new speculative attacks,” in SP, 2022, pp. 681–698.
- X. Luo and R. K. C. Chang, “On a new class of pulsing denial-of-service attacks and the defense,” in NDSS, 2005.
- E. Quiring, D. Klein, D. Arp, M. Johns, and K. Rieck, “Adversarial preprocessing: Understanding and preventing image-scaling attacks in machine learning,” in USENIX Security, 2020, pp. 1363–1380.
- Z. Zhan, Z. Zhang, S. Liang, F. Yao, and X. D. Koutsoukos, “Graphics peeping unit: Exploiting EM side-channel information of gpus to eavesdrop on your neighbors,” in SP, 2022, pp. 1440–1457.
- H. Mai, J. Zhao, H. Zheng, Y. Zhao, Z. Liu, M. Gao, C. Wang, H. Cui, X. Feng, and C. Kozyrakis, “Honeycomb: Secure and efficient GPU executions via static validation,” in OSDI, 2023, pp. 155–172.
- Y. Deng, C. Wang, S. Yu, S. Liu, Z. Ning, K. Leach, J. Li, S. Yan, Z. He, J. Cao, and F. Zhang, “Strongbox: A GPU TEE on arm endpoints,” in CCS, 2022, pp. 769–783.
- S. Tan, B. Knott, Y. Tian, and D. J. Wu, “Cryptgpu: Fast privacy-preserving machine learning on the GPU,” in SP, 2021, pp. 1021–1038.
- A. S. Rakin, Z. He, and D. Fan, “Bit-flip attack: Crushing neural network with progressive bit search,” in ICCV, 2019, pp. 1211–1220.
- F. Yao, A. S. Rakin, and D. Fan, “Deephammer: Depleting the intelligence of deep neural networks through targeted chain of bit flips,” in USENIX, 2020, pp. 1463–1480.
- J. Wang, Z. Zhang, M. Wang, H. Qiu, T. Zhang, Q. Li, Z. Li, T. Wei, and C. Zhang, “Aegis: Mitigating targeted bit-flip attacks against deep neural networks,” in USENIX, 2023, pp. 2329–2346.
- Q. Liu, J. Yin, W. Wen, C. Yang, and S. Sha, “Neuropots: Realtime proactive defense against bit-flip attacks in neural networks,” in USENIX, 2023, pp. 6347–6364.
- Y. Peng, Y. Zhu, Y. Chen, Y. Bao, B. Yi, C. Lan, C. Wu, and C. Guo, “A generic communication scheduler for distributed DNN training acceleration,” in SOSP, T. Brecht and C. Williamson, Eds. ACM, 2019, pp. 16–29.
- Y. Jiang, Y. Zhu, C. Lan, B. Yi, Y. Cui, and C. Guo, “A unified architecture for accelerating distributed DNN training in heterogeneous GPU/CPU clusters,” in OSDI, 2020, pp. 463–479.
- A. Wei, Y. Deng, C. Yang, and L. Zhang, “Free lunch for testing: Fuzzing deep-learning libraries from open source,” in ICSE, 2022, pp. 995–1007.
- R. Nakano, J. Hilton, S. Balaji, J. Wu, L. Ouyang, C. Kim, C. Hesse, S. Jain, V. Kosaraju, W. Saunders et al., “Webgpt: Browser-assisted question-answering with human feedback,” CoRR, vol. abs/2112.09332, 2021.
- Y. Shen, K. Song, X. Tan, D. Li, W. Lu, and Y. Zhuang, “Hugginggpt: Solving ai tasks with chatgpt and its friends in huggingface,” CoRR, vol. abs/2303.17580, 2023.
- Z. Xi, W. Chen, X. Guo, W. He, Y. Ding, B. Hong, M. Zhang, J. Wang, S. Jin, E. Zhou et al., “The rise and potential of large language model based agents: A survey. corr, abs/2309.07864, 2023. doi: 10.48550,” CoRR, vol. abs/2309.07864, 2023.
- L. Wang, C. Ma, X. Feng, Z. Zhang, H. Yang, J. Zhang, Z. Chen, J. Tang, X. Chen, Y. Lin et al., “A survey on large language model based autonomous agents,” CoRR, vol. abs/2309.07864, 2023.
- T. Gao, H. Yen, J. Yu, and D. Chen, “Enabling large language models to generate text with citations,” in EMNLP, 2023, pp. 6465–6488.
- W. Shi, X. Han, M. Lewis, Y. Tsvetkov, L. Zettlemoyer, and S. W. Yih, “Trusting your evidence: Hallucinate less with context-aware decoding,” CoRR, vol. abs/2305.14739, 2023.
- S. Yao, J. Zhao, D. Yu, N. Du, I. Shafran, K. R. Narasimhan, and Y. Cao, “React: Synergizing reasoning and acting in language models,” in ICLR, 2023.
- O. Ram, Y. Levine, I. Dalmedigos, D. Muhlgay, A. Shashua, K. Leyton-Brown, and Y. Shoham, “In-context retrieval-augmented language models,” CoRR, vol. abs/2302.00083, 2023.
- S. Zhang, L. Pan, J. Zhao, and W. Y. Wang, “Mitigating language model hallucination with interactive question-knowledge alignment,” CoRR, vol. abs/2305.13669, 2023.
- O. Press, M. Zhang, S. Min, L. Schmidt, N. A. Smith, and M. Lewis, “Measuring and narrowing the compositionality gap in language models,” in Findings of EMNLP, 2023, pp. 5687–5711.
- R. W. McGee, “Is chat gpt biased against conservatives? an empirical study,” An Empirical Study (February 15, 2023), 2023.
- T. Y. Zhuo, Y. Huang, C. Chen, and Z. Xing, “Exploring ai ethics of chatgpt: A diagnostic analysis,” CoRR, vol. abs/2301.12867, 2023.
- E. Ferrara, “Should chatgpt be biased? challenges and risks of bias in large language models,” CoRR, vol. abs/2304.03738, 2023.
- O. Oviedo-Trespalacios, A. E. Peden, T. Cole-Hunter, A. Costantini, M. Haghani, J. Rod, S. Kelly, H. Torkamaan, A. Tariq, J. D. A. Newton et al., “The risks of using chatgpt to obtain common safety-related information and advice,” Safety Science, vol. 167, p. 106244, 2023.
- N. Imran, A. Hashmi, and A. Imran, “Chat-gpt: Opportunities and challenges in child mental healthcare,” Pakistan Journal of Medical Sciences, vol. 39, no. 4.
- OPC, “Opc to investigate chatgpt jointly with provincial privacy authorities,” https://www.priv.gc.ca/en/opc-news/news-and-announcements/2023/an_230525-2/, 2023.
- M. Gurman, “Samsung bans staff’s ai use after spotting chatgpt data leak,” https://www.bloomberg.com/news/articles/2023-05-02/samsung-bans-chatgpt-and-other-generative-ai-use-by-staff-after-leak?srnd=technology-vp&in_source=embedded-checkout-banner/.
- S. Sabin, “Companies are struggling to keep corporate secrets out of chatgpt,” https://www.axios.com/2023/03/10/chatgpt-ai-cybersecurity-secrets/.
- Y. Elazar, N. Kassner, S. Ravfogel, A. Feder, A. Ravichander, M. Mosbach, Y. Belinkov, H. Schütze, and Y. Goldberg, “Measuring causal effects of data statistics on language model’sfactual’predictions,” CoRR, vol. abs/2207.14251, 2022.
- H. Alkaissi and S. I. McFarlane, “Artificial hallucinations in chatgpt: implications in scientific writing,” Cureus, vol. 15, no. 2, 2023.
- Y. Bang, S. Cahyawijaya, N. Lee, W. Dai, D. Su, B. Wilie, H. Lovenia, Z. Ji, T. Yu, W. Chung et al., “A multitask, multilingual, multimodal evaluation of chatgpt on reasoning, hallucination, and interactivity,” CoRR, vol. abs/2302.04023, 2023.
- J. Vincent, “Google’s ai chatbot bard makes factual error in first demo.” https://www.theverge.com/2023/2/8/23590864/google-ai-chatbot-bard-mistake-error-exoplanet-demo.
- I. Solaiman, M. Brundage, J. Clark, A. Askell, A. Herbert-Voss, J. Wu, A. Radford, G. Krueger, J. W. Kim, S. Kreps et al., “Release strategies and the social impacts of language models,” CoRR, vol. abs/1908.09203, 2019.
- J. Wu, W. Gan, Z. Chen, S. Wan, and H. Lin, “Ai-generated content (aigc): A survey,” CoRR, vol. abs/2304.06632, 2023.
- M. Elsen-Rooney, “Nyc education department blocks chatgpt on school devices, networks,” https://ny.chalkbeat.org/2023/1/3/23537987/nyc-schools-ban-chatgpt-writing-artificial-intelligence.
- U. Ede-Osifo, “College instructor put on blast for accusing students of using chatgpt on final assignments,” https://www.nbcnews.com/tech/chatgpt-texas-collegeinstructor-backlash-rcna8488.
- J. Lee, T. Le, J. Chen, and D. Lee, “Do language models plagiarize?” in Proceedings of the ACM Web Conference 2023, 2023, pp. 3637–3647.
- J. P. Wahle, T. Ruas, F. Kirstein, and B. Gipp, “How large language models are transforming machine-paraphrased plagiarism,” CoRR, vol. abs/2210.03568, 2022.
- P. Sharma and B. Dash, “Impact of big data analytics and chatgpt on cybersecurity,” in 2023 4th International Conference on Computing and Communication Systems (I3CS), 2023, pp. 1–6.
- P. Charan, H. Chunduri, P. M. Anand, and S. K. Shukla, “From text to mitre techniques: Exploring the malicious use of large language models for generating cyber attack payloads,” CoRR, vol. abs/2305.15336, 2023.
- O. Asare, M. Nagappan, and N. Asokan, “Is github’s copilot as bad as humans at introducing vulnerabilities in code?” CoRR, vol. abs/2204.04741, 2022.
- B. N, “Europol warns that hackers use chatgpt to conduct cyber attacks.” https://cybersecuritynews.com/hackers-use-chatgpt-to-conduct-cyber-attacks/.
- ——, “Chatgpt successfully built malware but failed to analyze the complex malware.” https://cybersecuritynews.com/chatgpt-failed-to-analyze-the-complex-malware/.
- Github, “Github copilot,” https://github.com/features/copilot, 2023.
- E. Crothers, N. Japkowicz, and H. L. Viktor, “Machine-generated text: A comprehensive survey of threat models and detection methods,” IEEE Access, 2023.
- R. Goodside, “Gpt-3 prompt injection defenses,” https://twitter.com/goodside/status/1578278974526222336?s=20&t=3UMZB7ntYhwAk3QLpKMAbw, 2022.
- L. Prompting, “Defensive measures,” https://learnprompting.org/docs/category/-defensive-measures, 2023.
- C. Mark, “Talking to machines: prompt engineering & injection,” https://artifact-research.com/artificial-intelligence/talking-to-machines-prompt-engineering-injection/, 2022.
- A. Volkov, “Discovery of sandwich defense,” https://twitter.com/altryne?ref_src=twsrc%5Egoogle%7Ctwcamp%5Eserp%7Ctwgr%5Eauthor, 2023.
- R. G. Stuart Armstrong, “Using gpt-eliezer against chatgpt jailbreaking,” https://www.alignmentforum.org/posts/pNcFYZnPdXyL2RfgA/using-gpt-eliezer-against-chatgpt-jailbreak, 2022.
- R. Goodside, “Quoted/escaped the input strings to defend against prompt attacks,” https://twitter.com/goodside/status/1569457230537441286?s=20, 2022.
- J. Selvi, “Exploring prompt injection attacks,” https://research.nccgroup.com/2022/12/05/exploring-prompt-injection-attacks/, 2022.
- J. Xu, D. Ju, M. Li, Y.-L. Boureau, J. Weston, and E. Dinan, “Recipes for safety in open-domain chatbots,” CoRR, vol. abs/2010.07079, 2020.
- S. Gehman, S. Gururangan, M. Sap, Y. Choi, and N. A. Smith, “Realtoxicityprompts: Evaluating neural toxic degeneration in language models,” in Findings, 2020.
- J. Welbl, A. Glaese, J. Uesato, S. Dathathri, J. F. J. Mellor, L. A. Hendricks, K. Anderson, P. Kohli, B. Coppin, and P.-S. Huang, “Challenges in detoxifying language models,” CoRR, vol. abs/2109.07445, 2021.
- I. Solaiman and C. Dennison, “Process for adapting language models to society (palms) with values-targeted datasets,” CoRR, vol. abs/2106.10328, 2021.
- B. Wang, W. Ping, C. Xiao, P. Xu, M. Patwary, M. Shoeybi, B. Li, A. Anandkumar, and B. Catanzaro, “Exploring the limits of domain-adaptive training for detoxifying large-scale language models,” CoRR, vol. abs/2202.04173, 2022.
- OpenAI, “GPT-4 Technical Report,” CoRR, vol. abs/2303.08774, 2023.
- NVIDIA, “Nemo guardrails,” https://github.com/NVIDIA/NeMo-Guardrails, 2023.
- nostalgebraist, “interpreting gpt: the logit lens,” https://www.alignmentforum.org/posts/AcKRB8wDpdaN6v6ru/interpreting-gpt-the-logit-lens, 2020.
- N. Belrose, Z. Furman, L. Smith, D. Halawi, I. Ostrovsky, L. McKinney, S. Biderman, and J. Steinhardt, “Eliciting latent predictions from transformers with the tuned lens,” CoRR, vol. abs/2303.08112, 2023.
- Z. Kan, L. Qiao, H. Yu, L. Peng, Y. Gao, and D. Li, “Protecting user privacy in remote conversational systems: A privacy-preserving framework based on text sanitization,” CoRR, vol. abs/2306.08223, 2023.
- Y. Li, Z. Tan, and Y. Liu, “Privacy-preserving prompt tuning for large language model services,” CoRR, vol. abs/2305.06212, 2023.
- P. Ruch, R. H. Baud, A. Rassinoux, P. Bouillon, and G. Robert, “Medical document anonymization with a semantic lexicon,” in AMIA, 2000.
- L. Deléger, K. Molnár, G. Savova, F. Xia, T. Lingren, Q. Li, K. Marsolo, A. G. Jegga, M. Kaiser, L. Stoutenborough, and I. Solti, “Large-scale evaluation of automated clinical note de-identification and its impact on information extraction,” J. Am. Medical Informatics Assoc., vol. 20, no. 1, pp. 84–94, 2013.
- F. Dernoncourt, J. Y. Lee, Ö. Uzuner, and P. Szolovits, “De-identification of patient notes with recurrent neural networks,” J. Am. Medical Informatics Assoc., vol. 24, no. 3, pp. 596–606, 2017.
- A. E. W. Johnson, L. Bulgarelli, and T. J. Pollard, “Deidentification of free-text medical records using pre-trained bidirectional transformers,” in CHIL, 2020, pp. 214–221.
- N. Kandpal, E. Wallace, and C. Raffel, “Deduplicating training data mitigates privacy risks in language models,” in ICML, ser. Proceedings of Machine Learning Research, vol. 162, 2022, pp. 10 697–10 707.
- C. Dwork, F. McSherry, K. Nissim, and A. D. Smith, “Calibrating noise to sensitivity in private data analysis,” J. Priv. Confidentiality, vol. 7, no. 3, pp. 17–51, 2016.
- C. Dwork, “A firm foundation for private data analysis,” Commun. ACM, vol. 54, no. 1, pp. 86–95, 2011.
- C. Dwork and A. Roth, “The algorithmic foundations of differential privacy,” Found. Trends Theor. Comput. Sci., vol. 9, no. 3-4, pp. 211–407, 2014.
- S. Hoory, A. Feder, A. Tendler, S. Erell, A. Peled-Cohen, I. Laish, H. Nakhost, U. Stemmer, A. Benjamini, A. Hassidim, and Y. Matias, “Learning and evaluating a differentially private pre-trained language model,” in EMNLP, 2021, pp. 1178–1189.
- J. Majmudar, C. Dupuy, C. Peris, S. Smaili, R. Gupta, and R. S. Zemel, “Differentially private decoding in large language models,” CoRR, vol. abs/2205.13621, 2022.
- D. Yu, S. Naik, A. Backurs, S. Gopi, H. A. Inan, G. Kamath, J. Kulkarni, Y. T. Lee, A. Manoel, and L. W. et al., “Differentially private fine-tuning of language models,” in ICLR, 2022.
- H. Ebadi, D. Sands, and G. Schneider, “Differential privacy: Now it’s getting personal,” in POPL, 2015, pp. 69–81.
- I. Kotsogiannis, S. Doudalis, S. Haney, A. Machanavajjhala, and S. Mehrotra, “One-sided differential privacy,” in ICDE, 2020, pp. 493–504.
- W. Shi, A. Cui, E. Li, R. Jia, and Z. Yu, “Selective differential privacy for language modeling,” in NAACL, 2022, pp. 2848–2859.
- W. Shi, R. Shea, S. Chen, C. Zhang, R. Jia, and Z. Yu, “Just fine-tune twice: Selective differential privacy for large language models,” in EMNLP, 2022, pp. 6327–6340.
- Z. Bu, Y. Wang, S. Zha, and G. Karypis, “Differentially private bias-term only fine-tuning of foundation models,” CoRR, vol. abs/2210.00036, 2022.
- A. Ginart, L. van der Maaten, J. Zou, and C. Guo, “Submix: Practical private prediction for large-scale language models,” CoRR, vol. abs/2201.00971, 2022.
- H. Duan, A. Dziedzic, N. Papernot, and F. Boenisch, “Flocks of stochastic parrots: Differentially private prompt learning for large language models,” CoRR, vol. abs/2305.15594, 2023.
- A. Panda, T. Wu, J. T. Wang, and P. Mittal, “Differentially private in-context learning,” CoRR, vol. abs/2305.01639, 2023.
- J. Pavlopoulos, P. Malakasiotis, and I. Androutsopoulos, “Deeper attention to abusive user content moderation,” in EMNLP, 2017, pp. 1125–1135.
- S. V. Georgakopoulos, S. K. Tasoulis, A. G. Vrahatis, and V. P. Plagianakos, “Convolutional neural networks for toxic comment classification,” in SETN, 2018, pp. 35:1–35:6.
- Z. Zhao, Z. Zhang, and F. Hopfgartner, “A comparative study of using pre-trained language models for toxic comment classification,” in WWW, 2021, pp. 500–507.
- C. AI, “Perspective api documentation,” https://github.com/conversationai/perspectiveapi, 2021.
- Azure, “Azure ai content safety,” https://azure.microsoft.com/en-us/products/ai-services/ai-content-safety, 2023.
- T. Bolukbasi, K. Chang, J. Y. Zou, V. Saligrama, and A. T. Kalai, “Man is to computer programmer as woman is to homemaker? debiasing word embeddings,” in NeurIPS, 2016, pp. 4349–4357.
- J. Zhao, T. Wang, M. Yatskar, R. Cotterell, V. Ordonez, and K. Chang, “Gender bias in contextualized word embeddings,” in NAACL-HLT, 2019, pp. 629–634.
- R. H. Maudslay, H. Gonen, R. Cotterell, and S. Teufel, “It’s all in the name: Mitigating gender bias with name-based counterfactual data substitution,” in EMNLP-IJCNLP, 2019, pp. 5266–5274.
- H. Thakur, A. Jain, P. Vaddamanu, P. P. Liang, and L. Morency, “Language models get a gender makeover: Mitigating gender bias with few-shot data interventions,” in ACL, 2023, pp. 340–351.
- C. N. dos Santos, I. Melnyk, and I. Padhi, “Fighting offensive language on social media with unsupervised text style transfer,” in ACL, 2018, pp. 189–194.
- L. Laugier, J. Pavlopoulos, J. Sorensen, and L. Dixon, “Civil rephrases of toxic texts with self-supervised transformers,” in EACL, 2021, pp. 1442–1461.
- V. Logacheva, D. Dementieva, S. Ustyantsev, D. Moskovskiy, D. Dale, I. Krotova, N. Semenov, and A. Panchenko, “Paradetox: Detoxification with parallel data,” in ACL, 2022, pp. 6804–6818.
- J. Zhao, Y. Zhou, Z. Li, W. Wang, and K. Chang, “Learning gender-neutral word embeddings,” in EMNLP, 2018, pp. 4847–4853.
- X. Peng, S. Li, S. Frazier, and M. O. Riedl, “Reducing non-normative text generation from language models,” in INLG, 2020, pp. 374–383.
- S. Dev, T. Li, J. M. Phillips, and V. Srikumar, “Oscar: Orthogonal subspace correction and rectification of biases in word embeddings,” in EMNLP, 2021, pp. 5034–5050.
- Z. Xie and T. Lukasiewicz, “An empirical analysis of parameter-efficient methods for debiasing pre-trained language models,” in ACL, 2023, pp. 15 730–15 745.
- X. He, S. Zannettou, Y. Shen, and Y. Zhang, “You only prompt once: On the capabilities of prompt learning on large language models to tackle toxic content,” CoRR, vol. abs/2308.05596, 2023.
- L. Ranaldi, E. S. Ruzzetti, D. Venditti, D. Onorati, and F. M. Zanzotto, “A trip towards fairness: Bias and de-biasing in large language models,” CoRR, vol. abs/2305.13862, 2023.
- A. Glaese, N. McAleese, M. Trebacz, J. Aslanides, V. Firoiu, T. Ewalds, M. Rauh, L. Weidinger, M. J. Chadwick, and P. T. et al., “Improving alignment of dialogue agents via targeted human judgements,” CoRR, vol. abs/2209.14375, 2022.
- H. Touvron, L. Martin, K. Stone, P. Albert, A. Almahairi, Y. Babaei, N. Bashlykov, S. Batra, P. Bhargava, and S. B. et al., “Llama 2: Open foundation and fine-tuned chat models,” CoRR, vol. abs/2307.09288, 2023.
- A. Abbas, K. Tirumala, D. Simig, S. Ganguli, and A. S. Morcos, “Semdedup: Data-efficient learning at web-scale through semantic deduplication,” CoRR, vol. abs/2303.09540, 2023.
- Y. Zhang, Y. Li, L. Cui, D. Cai, L. Liu, T. Fu, X. Huang, E. Zhao, Y. Zhang, Y. Chen, L. Wang, A. T. Luu, W. Bi, F. Shi, and S. Shi, “Siren’s song in the AI ocean: A survey on hallucination in large language models,” CoRR, vol. abs/2309.01219, 2023.
- Z. Sun, S. Shen, S. Cao, H. Liu, C. Li, Y. Shen, C. Gan, L. Gui, Y. Wang, Y. Yang, K. Keutzer, and T. Darrell, “Aligning large multimodal models with factually augmented RLHF,” CoRR, vol. abs/2309.14525, 2023.
- T. Shen, R. Jin, Y. Huang, C. Liu, W. Dong, Z. Guo, X. Wu, Y. Liu, and D. Xiong, “Large language model alignment: A survey,” CoRR, vol. abs/2309.15025, 2023.
- K. Huang, H. P. Chan, and H. Ji, “Zero-shot faithful factual error correction,” in ACL, 2023, pp. 5660–5676.
- A. Chen, P. Pasupat, S. Singh, H. Lee, and K. Guu, “PURR: efficiently editing language model hallucinations by denoising language model corruptions,” CoRR, vol. abs/2305.14908, 2023.
- R. Zhao, X. Li, S. Joty, C. Qin, and L. Bing, “Verify-and-edit: A knowledge-enhanced chain-of-thought framework,” in ACL, 2023, pp. 5823–5840.
- W. Yu, Z. Zhang, Z. Liang, M. Jiang, and A. Sabharwal, “Improving language models via plug-and-play retrieval feedback,” CoRR, vol. abs/2305.14002, 2023.
- Z. Feng, X. Feng, D. Zhao, M. Yang, and B. Qin, “Retrieval-generation synergy augmented large language models,” CoRR, vol. abs/2310.05149, 2023.
- Z. Shao, Y. Gong, Y. Shen, M. Huang, N. Duan, and W. Chen, “Enhancing retrieval-augmented large language models with iterative retrieval-generation synergy,” in Findings of EMNLP, 2023, pp. 9248–9274.
- S. Ahn, H. Choi, T. Pärnamaa, and Y. Bengio, “A neural knowledge language model,” CoRR, vol. abs/1608.00318, 2016.
- R. L. L. IV, N. F. Liu, M. E. Peters, M. Gardner, and S. Singh, “Barack’s wife hillary: Using knowledge graphs for fact-aware language modeling,” in ACL, 2019, pp. 5962–5971.
- Y. Wen, Z. Wang, and J. Sun, “Mindmap: Knowledge graph prompting sparks graph of thoughts in large language models,” CoRR, vol. abs/2308.09729, 2023.
- Z. Gou, Z. Shao, Y. Gong, Y. Shen, Y. Yang, N. Duan, and W. Chen, “CRITIC: large language models can self-correct with tool-interactive critiquing,” CoRR, vol. abs/2305.11738, 2023.
- N. Varshney, W. Yao, H. Zhang, J. Chen, and D. Yu, “A stitch in time saves nine: Detecting and mitigating hallucinations of llms by validating low-confidence generation,” CoRR, vol. abs/2307.03987, 2023.
- Y. Chuang, Y. Xie, H. Luo, Y. Kim, J. R. Glass, and P. He, “Dola: Decoding by contrasting layers improves factuality in large language models,” CoRR, vol. abs/2309.03883, 2023.
- K. Li, O. Patel, F. B. Viégas, H. Pfister, and M. Wattenberg, “Inference-time intervention: Eliciting truthful answers from a language model,” CoRR, vol. abs/2306.03341, 2023.
- X. L. Li, A. Holtzman, D. Fried, P. Liang, J. Eisner, T. Hashimoto, L. Zettlemoyer, and M. Lewis, “Contrastive decoding: Open-ended text generation as optimization,” in ACL, 2023, pp. 12 286–12 312.
- S. Willison, “Reducing sycophancy and improving honesty via activation steering,” https://www.alignmentforum.org/posts/zt6hRsDE84HeBKh7E/reducing-sycophancy-and-improving-honesty-via-activation, 2023.
- Y. Du, S. Li, A. Torralba, J. B. Tenenbaum, and I. Mordatch, “Improving factuality and reasoning in language models through multiagent debate,” CoRR, vol. abs/2305.14325, 2023.
- R. Cohen, M. Hamri, M. Geva, and A. Globerson, “LM vs LM: detecting factual errors via cross examination,” in EMNLP, 2023, pp. 12 621–12 640.
- N. Akhtar and A. S. Mian, “Threat of adversarial attacks on deep learning in computer vision: A survey,” IEEE Access, vol. 6, pp. 14 410–14 430, 2018.
- M. Jagielski, N. Carlini, D. Berthelot, A. Kurakin, and N. Papernot, “High-fidelity extraction of neural network models,” CoRR, vol. abs/1909.01838, 2019.
- F. Tramèr, F. Zhang, A. Juels, M. K. Reiter, and T. Ristenpart, “Stealing machine learning models via prediction apis,” in USENIX Security, 2016, pp. 601–618.
- T. Orekondy, B. Schiele, and M. Fritz, “Prediction poisoning: Towards defenses against DNN model stealing attacks,” in ICLR, 2020.
- I. M. Alabdulmohsin, X. Gao, and X. Zhang, “Adding robustness to support vector machines against adversarial reverse engineering,” in CIKM, 2014, pp. 231–240.
- V. Chandrasekaran, K. Chaudhuri, I. Giacomelli, S. Jha, and S. Yan, “Model extraction and active learning,” CoRR, vol. abs/1811.02054, 2018.
- T. Lee, B. Edwards, I. M. Molloy, and D. Su, “Defending against neural network model stealing attacks using deceptive perturbations,” in S&P Workshop, 2019, pp. 43–49.
- M. Juuti, S. Szyller, S. Marchal, and N. Asokan, “PRADA: protecting against DNN model stealing attacks,” in EuroS&P, 2019, pp. 512–527.
- H. Jia, C. A. Choquette-Choo, V. Chandrasekaran, and N. Papernot, “Entangled watermarks as a defense against model extraction,” in USENIX Security, 2021, pp. 1937–1954.
- A. B. Kahng, J. C. Lach, W. H. Mangione-Smith, S. Mantik, I. L. Markov, M. Potkonjak, P. Tucker, H. Wang, and G. Wolfe, “Watermarking techniques for intellectual property protection,” in DAC, 1998, pp. 776–781.
- M. Abadi, A. Chu, I. J. Goodfellow, H. B. McMahan, I. Mironov, K. Talwar, and L. Zhang, “Deep learning with differential privacy,” in SIGSAC, 2016, pp. 308–318.
- C. Dwork, “Differential privacy: A survey of results,” in TAMC, 2008, pp. 1–19.
- D. Chen, N. Yu, and M. Fritz, “Relaxloss: Defending membership inference attacks without losing utility,” in ICLR, 2022.
- C. Guo, G. Pleiss, Y. Sun, and K. Q. Weinberger, “On calibration of modern neural networks,” in ICML, 2017, pp. 1321–1330.
- G. Pereyra, G. Tucker, J. Chorowski, L. Kaiser, and G. E. Hinton, “Regularizing neural networks by penalizing confident output distributions,” in ICLR workshop, 2017.
- M. Nasr, R. Shokri, and A. Houmansadr, “Machine learning with membership privacy using adversarial regularization,” in CCS, 2018, pp. 634–646.
- J. Jia and N. Z. Gong, “Attriguard: A practical defense against attribute inference attacks via adversarial machine learning,” in USENIX Security, 2018, pp. 513–529.
- S. Awan, B. Luo, and F. Li, “CONTRA: defending against poisoning attacks in federated learning,” in ESORICS, 2021, pp. 455–475.
- F. Qi, M. Li, Y. Chen, Z. Zhang, Z. Liu, Y. Wang, and M. Sun, “Hidden killer: Invisible textual backdoor attacks with syntactic trigger,” in ACL/IJCNLP, 2021, pp. 443–453.
- W. Yang, Y. Lin, P. Li, J. Zhou, and X. Sun, “Rethinking stealthiness of backdoor attack against NLP models,” in ACL/IJCNLP, 2021, pp. 5543–5557.
- B. Wang, Y. Yao, S. Shan, H. Li, B. Viswanath, H. Zheng, and B. Y. Zhao, “Neural cleanse: Identifying and mitigating backdoor attacks in neural networks,” in S&P, 2019, pp. 707–723.
- Y. Liu, W. Lee, G. Tao, S. Ma, Y. Aafer, and X. Zhang, “ABS: scanning neural networks for back-doors by artificial brain stimulation,” in CCS, 2019, pp. 1265–1282.
- J. Lu, T. Issaranon, and D. A. Forsyth, “Safetynet: Detecting and rejecting adversarial examples robustly,” in ICCV, 2017, pp. 446–454.
- J. H. Metzen, T. Genewein, V. Fischer, and B. Bischoff, “On detecting adversarial perturbations,” in ICLR, 2017, p. 105978.
- S. Gu and L. Rigazio, “Towards deep neural network architectures robust to adversarial examples,” in ICLR workshop, 2015.
- D. Meng and H. Chen, “Magnet: A two-pronged defense against adversarial examples,” in Proceedings of the 2017 ACM SIGSAC Conference on Computer and Communications Security, CCS 2017, Dallas, TX, USA, October 30 - November 03, 2017, 2017, pp. 135–147.
- G. Katz, C. W. Barrett, D. L. Dill, K. Julian, and M. J. Kochenderfer, “Reluplex: An efficient SMT solver for verifying deep neural networks,” in CAV, 2017, pp. 97–117.
- D. Gopinath, G. Katz, C. S. Pasareanu, and C. W. Barrett, “Deepsafe: A data-driven approach for checking adversarial robustness in neural networks,” CoRR, vol. abs/1710.00486, 2017.
- N. Papernot, P. D. McDaniel, X. Wu, S. Jha, and A. Swami, “Distillation as a defense to adversarial perturbations against deep neural networks,” in S&P, 2016, pp. 582–597.
- G. E. Hinton, O. Vinyals, and J. Dean, “Distilling the knowledge in a neural network,” CoRR, vol. abs/1503.02531, 2015.
- R. Huang, B. Xu, D. Schuurmans, and C. Szepesvári, “Learning with a strong adversary,” CoRR, vol. abs/1511.03034, 2015.
- OWASP, “Owasp top 10 for llm applications,” https://owasp.org/www-project-top-10-for-large-language-model-applications/assets/PDF/OWASP-Top-10-for-LLMs-2023-v1_0_1.pdf, 2023.
- E. Göktas, E. Athanasopoulos, H. Bos, and G. Portokalidis, “Out of control: Overcoming control-flow integrity,” in SP, 2014, pp. 575–589.
- N. Carlini, A. Barresi, M. Payer, D. A. Wagner, and T. R. Gross, “Control-flow bending: On the effectiveness of control-flow integrity,” in USENIX Security, 2015, pp. 161–176.
- C. Zhang, T. Wei, Z. Chen, L. Duan, L. Szekeres, S. McCamant, D. Song, and W. Zou, “Practical control flow integrity and randomization for binary executables,” in SP, 2013, pp. 559–573.
- R. T. Gollapudi, G. Yuksek, D. Demicco, M. Cole, G. Kothari, R. Kulkarni, X. Zhang, K. Ghose, A. Prakash, and Z. Umrigar, “Control flow and pointer integrity enforcement in a secure tagged architecture,” in SP, 2023, pp. 2974–2989.
- W. U. Hassan, M. Lemay, N. Aguse, A. Bates, and T. Moyer, “Towards scalable cluster auditing through grammatical inference over provenance graphs,” in NDSS, 2018.
- X. Han, T. F. J. Pasquier, A. Bates, J. Mickens, and M. I. Seltzer, “Unicorn: Runtime provenance-based detector for advanced persistent threats,” in NDSS, 2020.
- Q. Wang, W. U. Hassan, D. Li, K. Jee, X. Yu, K. Zou, J. Rhee, Z. Chen, W. Cheng, C. A. Gunter, and H. Chen, “You are what you do: Hunting stealthy malware via data provenance analysis,” in NDSS, 2020.
- L. Yu, S. Ma, Z. Zhang, G. Tao, X. Zhang, D. Xu, V. E. Urias, H. W. Lin, G. F. Ciocarlie, V. Yegneswaran, and A. Gehani, “Alchemist: Fusing application and audit logs for precise attack provenance without instrumentation,” in NDSS, 2021.
- H. Ding, J. Zhai, D. Deng, and S. Ma, “The case for learned provenance graph storage systems,” in USENIX Security, 2023.
- F. Yang, J. Xu, C. Xiong, Z. Li, and K. Zhang, “PROGRAPHER: an anomaly detection system based on provenance graph embedding,” in USENIX Security, 2023.
- A. Tabiban, H. Zhao, Y. Jarraya, M. Pourzandi, M. Zhang, and L. Wang, “Provtalk: Towards interpretable multi-level provenance analysis in networking functions virtualization (NFV),” in NDSS, 2022.
- A. Bates, D. Tian, K. R. B. Butler, and T. Moyer, “Trustworthy whole-system provenance for the linux kernel,” in USENIX Security, 2015, pp. 319–334.
- S. M. Milajerdi, R. Gjomemo, B. Eshete, R. Sekar, and V. N. Venkatakrishnan, “HOLMES: real-time APT detection through correlation of suspicious information flows,” in SP, 2019, pp. 1137–1152.
- A. Alsaheel, Y. Nan, S. Ma, L. Yu, G. Walkup, Z. B. Celik, X. Zhang, and D. Xu, “ATLAS: A sequence-based learning approach for attack investigation,” in USENIX Security, 2021, pp. 3005–3022.
- K. Mukherjee, J. Wiedemeier, T. Wang, J. Wei, F. Chen, M. Kim, M. Kantarcioglu, and K. Jee, “Evading provenance-based ML detectors with adversarial system actions,” in USENIX Security, 2023, pp. 1199–1216.
- M. A. Inam, Y. Chen, A. Goyal, J. Liu, J. Mink, N. Michael, S. Gaur, A. Bates, and W. U. Hassan, “Sok: History is a vast early warning system: Auditing the provenance of system intrusions,” in SP, 2023, pp. 2620–2638.
- C. Fu, Q. Li, M. Shen, and K. Xu, “Realtime robust malicious traffic detection via frequency domain analysis,” in CCS, 2021, pp. 3431–3446.
- D. Barradas, N. Santos, L. Rodrigues, S. Signorello, F. M. V. Ramos, and A. Madeira, “Flowlens: Enabling efficient flow classification for ml-based network security applications,” in NDSS, 2021.
- G. Zhou, Z. Liu, C. Fu, Q. Li, and K. Xu, “An efficient design of intelligent network data plane,” in USENIX Security, 2023.
- S. Panda et al., “Smartwatch: accurate traffic analysis and flow-state tracking for intrusion prevention using smartnics,” in CoNEXT, 2021, pp. 60–75.
- G. Siracusano et al., “Re-architecting traffic analysis with neural network interface cards,” in NSDI, 2022, pp. 513–533.
- Y. Mirsky, T. Doitshman, Y. Elovici, and A. Shabtai, “Kitsune: An ensemble of autoencoders for online network intrusion detection,” in NDSS, 2018.
- J. Holland, P. Schmitt, N. Feamster, and P. Mittal, “New directions in automated traffic analysis,” in CCS, 2021, pp. 3366–3383.
- C. Fu, Q. Li, and K. Xu, “Detecting unknown encrypted malicious traffic in real time via flow interaction graph analysis,” in NDSS, 2023.
- M. Tran et al., “On the feasibility of rerouting-based ddos defenses,” in SP, 2019, pp. 1169–1184.
- D. Wagner et al., “United we stand: Collaborative detection and mitigation of amplification ddos attacks at scale,” in CCS, 2021, pp. 970–987.
- M. Wichtlhuber et al., “IXP scrubber: learning from blackholing traffic for ml-driven ddos detection at scale,” in SIGCOMM, 2022, pp. 707–722.
- VirusTotal, “Virustotal,” https://www.virustotal.com/gui/home/upload, 2023.
- S. Thirumuruganathan, M. Nabeel, E. Choo, I. Khalil, and T. Yu, “Siraj: a unified framework for aggregation of malicious entity detectors,” in SP, 2022, pp. 507–521.
- T. Scholte, W. Robertson, D. Balzarotti, and E. Kirda, “Preventing input validation vulnerabilities in web applications through automated type analysis,” in CSA, 2012, pp. 233–243.
- A. Blankstein and M. J. Freedman, “Automating isolation and least privilege in web services,” in SP, 2014, pp. 133–148.
- D. Sánchez, M. Batet, and A. Viejo, “Automatic general-purpose sanitization of textual documents,” IEEE Transactions on Information Forensics and Security, vol. 8, no. 6, pp. 853–862, 2013.
- Y. Guo, J. Liu, W. Tang, and C. Huang, “Exsense: Extract sensitive information from unstructured data,” Computers & Security, vol. 102, p. 102156, 2021.
- F. Hassan, D. Sánchez, J. Soria-Comas, and J. Domingo-Ferrer, “Automatic anonymization of textual documents: detecting sensitive information via word embeddings,” in TrustCom/BigDataSE, 2019, pp. 358–365.
- W. G. D. Note, “Ethical principles for web machine learning,” https://www.w3.org/TR/webmachinelearning-ethics, 2023.
- G. AI, “Guardrails ai,” https://www.guardrailsai.com/docs/, 2023.
- Laiyer.ai, “Llm guard - the security toolkit for llm interactions,” https://github.com/laiyer-ai/llm-guard/, 2023.
- Azure, “Content filtering,” https://learn.microsoft.com/en-us/azure/ai-services/openai/concepts/content-filter?tabs=warning%2Cpython, 2023.
- K. Gémes and G. Recski, “Tuw-inf at germeval2021: Rule-based and hybrid methods for detecting toxic, engaging, and fact-claiming comments,” in GermEval KONVENS, 2021, pp. 69–75.
- K. Gémes, Á. Kovács, and G. Recski, “Offensive text detection across languages and datasets using rule-based and hybrid methods,” in CIKM workshop, 2022.
- P. Nakov, V. Nayak, K. Dent, A. Bhatawdekar, S. M. Sarwar, M. Hardalov, Y. Dinkov, D. Zlatkova, G. Bouchard, and I. Augenstein, “Detecting abusive language on online platforms: A critical analysis,” CoRR, vol. abs/2103.00153, 2021.
- F. Alam, S. Cresci, T. Chakraborty, F. Silvestri, D. Dimitrov, G. D. S. Martino, S. Shaar, H. Firooz, and P. Nakov, “A survey on multimodal disinformation detection,” CoRR, vol. abs/2103.12541, 2021.
- P. Nakov, H. T. Sencar, J. An, and H. Kwak, “A survey on predicting the factuality and the bias of news media,” CoRR, vol. abs/2103.12506, 2021.
- T. Hartvigsen, S. Gabriel, H. Palangi, M. Sap, D. Ray, and E. Kamar, “Toxigen: A large-scale machine-generated dataset for adversarial and implicit hate speech detection,” CoRR, vol. abs/2203.09509, 2022.
- A. V. Lilian Weng, Vik Goel, “Using gpt-4 for content moderation,” https://searchengineland.com/openai-ai-classifier-no-longer-available-429912/, 2023.
- M. AI, “Llama 2 responsible use guide,” https://ai.meta.com/llama/responsible-use-guide/, 2023.
- J. Chen, G. Kim, A. Sriram, G. Durrett, and E. Choi, “Complex claim verification with evidence retrieved in the wild,” CoRR, vol. abs/2305.11859, 2023.
- B. A. Galitsky, “Truth-o-meter: Collaborating with llm in fighting its hallucinations,” 2023.
- S. Min, K. Krishna, X. Lyu, M. Lewis, W.-t. Yih, P. W. Koh, M. Iyyer, L. Zettlemoyer, and H. Hajishirzi, “Factscore: Fine-grained atomic evaluation of factual precision in long form text generation,” CoRR, vol. abs/2305.14251, 2023.
- F. Nan, R. Nallapati, Z. Wang, C. N. d. Santos, H. Zhu, D. Zhang, K. McKeown, and B. Xiang, “Entity-level factual consistency of abstractive text summarization,” CoRR, vol. abs/2102.09130, 2021.
- J. Maynez, S. Narayan, B. Bohnet, and R. McDonald, “On faithfulness and factuality in abstractive summarization,” CoRR, vol. abs/2005.00661, 2020.
- A. Agrawal, L. Mackey, and A. T. Kalai, “Do language models know when they’re hallucinating references?” CoRR, vol. abs/2305.18248, 2023.
- R. Cohen, M. Hamri, M. Geva, and A. Globerson, “Lm vs lm: Detecting factual errors via cross examination,” CoRR, vol. abs/2305.13281, 2023.
- T. Scialom, P.-A. Dray, P. Gallinari, S. Lamprier, B. Piwowarski, J. Staiano, and A. Wang, “Questeval: Summarization asks for fact-based evaluation,” CoRR, vol. abs/2103.12693, 2021.
- O. Honovich, L. Choshen, R. Aharoni, E. Neeman, I. Szpektor, and O. Abend, “q2superscript𝑞2q^{2}italic_q start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT: Evaluating factual consistency in knowledge-grounded dialogues via question generation and question answering,” CoRR, vol. abs/2104.08202, 2021.
- A. R. Fabbri, C.-S. Wu, W. Liu, and C. Xiong, “Qafacteval: Improved qa-based factual consistency evaluation for summarization,” CoRR, vol. abs/2112.08542, 2021.
- Z. Guo, M. Schlichtkrull, and A. Vlachos, “A survey on automated fact-checking,” Transactions of the Association for Computational Linguistics, vol. 10, pp. 178–206, 2022.
- R. Zhao, X. Li, S. Joty, C. Qin, and L. Bing, “Verify-and-edit: A knowledge-enhanced chain-of-thought framework,” CoRR, vol. abs/2305.03268, 2023.
- L. Gao, Z. Dai, P. Pasupat, A. Chen, A. T. Chaganty, Y. Fan, V. Zhao, N. Lao, H. Lee, D.-C. Juan et al., “Rarr: Researching and revising what language models say, using language models,” in ACL, 2023, pp. 16 477–16 508.
- Z. Gou, Z. Shao, Y. Gong, Y. Shen, Y. Yang, N. Duan, and W. Chen, “Critic: Large language models can self-correct with tool-interactive critiquing,” CoRR, vol. abs/2305.11738, 2023.
- X. Wang, J. Wei, D. Schuurmans, Q. Le, E. Chi, S. Narang, A. Chowdhery, and D. Zhou, “Self-consistency improves chain of thought reasoning in language models,” CoRR, vol. abs/2203.11171, 2022.
- R. Tang, Y.-N. Chuang, and X. Hu, “The science of detecting llm-generated texts,” CoRR, vol. abs/2303.07205, 2023.
- J. Kirchenbauer, J. Geiping, Y. Wen, J. Katz, I. Miers, and T. Goldstein, “A watermark for large language models,” CoRR, vol. abs/2301.10226, 2023.
- J. Fang, Z. Tan, and X. Shi, “Cosywa: Enhancing semantic integrity in watermarking natural language generation,” in NLPCC, 2023, pp. 708–720.
- M. J. Atallah, V. Raskin, M. Crogan, C. Hempelmann, F. Kerschbaum, D. Mohamed, and S. Naik, “Natural language watermarking: Design, analysis, and a proof-of-concept implementation,” in Information Hiding, 2001, pp. 185–200.
- Z. Jalil and A. M. Mirza, “A review of digital watermarking techniques for text documents,” in ICIMT, 2009, pp. 230–234.
- U. Topkara, M. Topkara, and M. J. Atallah, “The hiding virtues of ambiguity: quantifiably resilient watermarking of natural language text through synonym substitutions,” in MM&Sec, 2006, pp. 164–174.
- J. T. Brassil, S. Low, N. F. Maxemchuk, and L. O’Gorman, “Electronic marking and identification techniques to discourage document copying,” IEEE Journal on Selected Areas in Communications, vol. 13, no. 8, pp. 1495–1504, 1995.
- S. Abdelnabi and M. Fritz, “Adversarial watermarking transformer: Towards tracing text provenance with data hiding,” in S&P, 2021, pp. 121–140.
- V. S. Sadasivan, A. Kumar, S. Balasubramanian, W. Wang, and S. Feizi, “Can ai-generated text be reliably detected?” CoRR, vol. abs/2303.11156, 2023.
- G. Li, Y. Chen, J. Zhang, J. Li, S. Guo, and T. Zhang, “Warfare:breaking the watermark protection of ai-generated content,” CoRR, vol. abs/2310.07726, 2023.
- B. Huang, B. Zhu, H. Zhu, J. D. Lee, J. Jiao, and M. I. Jordan, “Towards optimal statistical watermarking,” CoRR, vol. abs/2312.07930, 2023.
- C. Chen, Y. Li, Z. Wu, M. Xu, R. Wang, and Z. Zheng, “Towards reliable utilization of AIGC: blockchain-empowered ownership verification mechanism,” IEEE Open J. Comput. Soc., vol. 4, pp. 326–337, 2023.
- A. Chakraborty, M. Alam, V. Dey, A. Chattopadhyay, and D. Mukhopadhyay, “A survey on adversarial attacks and defences,” CAAI Trans. Intell. Technol., vol. 6, no. 1, pp. 25–45, 2021.
- K. Zhu, J. Wang, J. Zhou, Z. Wang, H. Chen, Y. Wang, L. Yang, W. Ye, N. Z. Gong, Y. Zhang et al., “Promptbench: Towards evaluating the robustness of large language models on adversarial prompts,” CoRR, vol. abs/2306.04528, 2023.
- B. Wang, C. Xu, S. Wang, Z. Gan, Y. Cheng, J. Gao, A. H. Awadallah, and B. Li, “Adversarial glue: A multi-task benchmark for robustness evaluation of language models,” CoRR, vol. abs/2111.02840, 2021.
- Y. Nie, A. Williams, E. Dinan, M. Bansal, J. Weston, and D. Kiela, “Adversarial nli: A new benchmark for natural language understanding,” CoRR, vol. abs/1910.14599, 2019.
- L. Yang, S. Zhang, L. Qin, Y. Li, Y. Wang, H. Liu, J. Wang, X. Xie, and Y. Zhang, “Glue-x: Evaluating natural language understanding models from an out-of-distribution generalization perspective,” CoRR, vol. abs/2211.08073, 2022.
- L. Yuan, Y. Chen, G. Cui, H. Gao, F. Zou, X. Cheng, H. Ji, Z. Liu, and M. Sun, “Revisiting out-of-distribution robustness in nlp: Benchmark, analysis, and llms evaluations,” CoRR, vol. abs/2306.04618, 2023.
- N. Vaghani and M. Thummar, “Flipkart product reviews with sentiment dataset,” https://www.kaggle.com/dsv/4940809, 2023.
- T. Liu, Y. Zhang, C. Brockett, Y. Mao, Z. Sui, W. Chen, and B. Dolan, “A token-level reference-free hallucination detection benchmark for free-form text generation,” CoRR, vol. abs/2104.08704, 2021.
- P. Manakul, A. Liusie, and M. J. F. Gales, “Selfcheckgpt: Zero-resource black-box hallucination detection for generative large language models,” in EMNLP, H. Bouamor, J. Pino, and K. Bali, Eds., 2023, pp. 9004–9017.
- L. K. Umapathi, A. Pal, and M. Sankarasubbu, “Med-halt: Medical domain hallucination test for large language models,” CoRR, vol. abs/2307.15343, 2023.
- J. Li, X. Cheng, W. X. Zhao, J.-Y. Nie, and J.-R. Wen, “Halueval: A large-scale hallucination evaluation benchmark for large language models,” in EMNLP, 2023, pp. 6449–6464.
- J. Luo, C. Xiao, and F. Ma, “Zero-resource hallucination prevention for large language models,” CoRR, vol. abs/2309.02654, 2023.
- S. Casper, J. Lin, J. Kwon, G. Culp, and D. Hadfield-Menell, “Explore, establish, exploit: Red teaming language models from scratch,” CoRR, vol. abs/2306.09442, 2023.
- B. Mathew, P. Saha, S. M. Yimam, C. Biemann, P. Goyal, and A. Mukherjee, “Hatexplain: A benchmark dataset for explainable hate speech detection,” in AAAI, 2021, pp. 14 867–14 875.
- Y. Huang, Q. Zhang, L. Sun et al., “Trustgpt: A benchmark for trustworthy and responsible large language models,” CoRR, vol. abs/2306.11507, 2023.
- J. Deng, J. Zhou, H. Sun, C. Zheng, F. Mi, H. Meng, and M. Huang, “Cold: A benchmark for chinese offensive language detection,” CoRR, vol. abs/2201.06025, 2022.
- G. Xu, J. Liu, M. Yan, H. Xu, J. Si, Z. Zhou, P. Yi, X. Gao, J. Sang, R. Zhang et al., “Cvalues: Measuring the values of chinese large language models from safety to responsibility,” CoRR, vol. abs/2307.09705, 2023.
- J. Zhang, K. Bao, Y. Zhang, W. Wang, F. Feng, and X. He, “Is chatgpt fair for recommendation? evaluating fairness in large language model recommendation,” CoRR, vol. abs/2305.07609, 2023.
- J. Dhamala, T. Sun, V. Kumar, S. Krishna, Y. Pruksachatkun, K.-W. Chang, and R. Gupta, “Bold: Dataset and metrics for measuring biases in open-ended language generation,” in FAccT, 2021, pp. 862–872.
- E. M. Smith, M. Hall, M. Kambadur, E. Presani, and A. Williams, ““i’m sorry to hear that”: Finding new biases in language models with a holistic descriptor dataset,” in EMNLP, 2022, pp. 9180–9211.
- J. Zhou, J. Deng, F. Mi, Y. Li, Y. Wang, M. Huang, X. Jiang, Q. Liu, and H. Meng, “Towards identifying social bias in dialog systems: Frame, datasets, and benchmarks,” CoRR, vol. abs/2202.08011, 2022.
- A. P. Parikh, X. Wang, S. Gehrmann, M. Faruqui, B. Dhingra, D. Yang, and D. Das, “Totto: A controlled table-to-text generation dataset,” CoRR, vol. abs/2004.14373, 2023.
- E. Durmus, H. He, and M. Diab, “Feqa: A question answering evaluation framework for faithfulness assessment in abstractive summarization,” CoRR, vol. abs/2005.03754, 2020.
- B. Dhingra, M. Faruqui, A. Parikh, M.-W. Chang, D. Das, and W. W. Cohen, “Handling divergent reference texts when evaluating table-to-text generation,” CoRR, vol. abs/1906.01081, 2019.
- B. Goodrich, V. Rao, P. J. Liu, and M. Saleh, “Assessing the factual accuracy of generated text,” in SIGKDD, 2019, pp. 166–175.
- T. Falke, L. F. Ribeiro, P. A. Utama, I. Dagan, and I. Gurevych, “Ranking generated summaries by correctness: An interesting but challenging application for natural language inference,” in ACL, 2019, pp. 2214–2220.
- J. Pfeiffer, F. Piccinno, M. Nicosia, X. Wang, M. Reid, and S. Ruder, “mmt5: Modular multilingual pre-training solves source language hallucinations,” CoRR, vol. abs/2305.14224, 2023.
- K. Filippova, “Controlled hallucinations: Learning to generate faithfully from noisy data,” CoRR, vol. abs/2010.05873, 2020.
- F. Nie, J.-G. Yao, J. Wang, R. Pan, and C.-Y. Lin, “A simple recipe towards reducing hallucination in neural surface realisation,” in ACL, 2019, pp. 2673–2679.
- Y. Wang, Y. Zhao, and L. Petzold, “Are large language models ready for healthcare? a comparative study on clinical language understanding,” CoRR, vol. abs/2304.05368, 2023.
- OpenAI, “Open AI Privacy Policy,” https://openai.com/policies/privacy-policy, 2023.
- S. A. Khowaja, P. Khuwaja, and K. Dev, “Chatgpt needs spade (sustainability, privacy, digital divide, and ethics) evaluation: A review,” CoRR, vol. abs/2305.03123, 2023.
- L. Reynolds and K. McDonell, “Prompt programming for large language models: Beyond the few-shot paradigm,” in CHI Extended Abstracts, 2021, pp. 1–7.
- H. Brown, K. Lee, F. Mireshghallah, R. Shokri, and F. Tramèr, “What does it mean for a language model to preserve privacy?” in FAccT, 2022, pp. 2280–2292.
- X. Li, Y. Li, L. Liu, L. Bing, and S. Joty, “Is gpt-3 a psychopath? evaluating large language models from a psychological perspective,” CoRR, vol. abs/2212.10529, 2022.
- J. Rutinowski, S. Franke, J. Endendyk, I. Dormuth, and M. Pauly, “The self-perception and political biases of chatgpt,” CoRR, vol. abs/2304.07333, 2023.
- M. Das, S. K. Pandey, and A. Mukherjee, “Evaluating chatgpt’s performance for multilingual and emoji-based hate speech detection,” CoRR, vol. abs/2305.13276, 2023.
- D. Hendrycks, C. Burns, S. Basart, A. Critch, J. Li, D. Song, and J. Steinhardt, “Aligning ai with shared human values,” CoRR, vol. abs/2008.02275, 2020.
- F. Huang, H. Kwak, and J. An, “Is chatgpt better than human annotators? potential and limitations of chatgpt in explaining implicit hate speech,” CoRR, vol. abs/2302.07736, 2023.
- E. Sheng, K.-W. Chang, P. Natarajan, and N. Peng, “Societal biases in language generation: Progress and challenges,” CoRR, vol. abs/2105.04054, 2021.
- M. Nadeem, A. Bethke, and S. Reddy, “Stereoset: Measuring stereotypical bias in pretrained language models,” CoRR, vol. abs/2004.09456, 2020.
- J. Hartmann, J. Schwenzow, and M. Witte, “The political ideology of conversational ai: Converging evidence on chatgpt’s pro-environmental, left-libertarian orientation,” CoRR, vol. abs/2301.01768, 2023.
- Y. Cao, L. Zhou, S. Lee, L. Cabello, M. Chen, and D. Hershcovich, “Assessing cross-cultural alignment between chatgpt and human societies: An empirical study,” CoRR, vol. abs/2303.17466, 2023.
- A. Ramezani and Y. Xu, “Knowledge of cultural moral norms in large language models,” CoRR, vol. abs/2306.01857, 2023.
- Y. Wan, W. Wang, P. He, J. Gu, H. Bai, and M. Lyu, “Biasasker: Measuring the bias in conversational ai system,” CoRR, vol. abs/2305.12434, 2023.
- Q. Luo, M. J. Puett, and M. D. Smith, “A perspectival mirror of the elephant: Investigating language bias on google, chatgpt, wikipedia, and youtube,” CoRR, vol. abs/2303.16281, 2023.
- Y. Tian, X. Yang, J. Zhang, Y. Dong, and H. Su, “Evil geniuses: Delving into the safety of llm-based agents,” arXiv preprint arXiv:2311.11855, 2023.
- Tianyu Cui (23 papers)
- Yanling Wang (14 papers)
- Chuanpu Fu (4 papers)
- Yong Xiao (72 papers)
- Sijia Li (33 papers)
- Xinhao Deng (8 papers)
- Yunpeng Liu (55 papers)
- Qinglin Zhang (30 papers)
- Ziyi Qiu (3 papers)
- Peiyang Li (11 papers)
- Zhixing Tan (20 papers)
- Junwu Xiong (12 papers)
- Xinyu Kong (5 papers)
- Zujie Wen (21 papers)
- Ke Xu (309 papers)
- Qi Li (352 papers)