Causality Analysis for Evaluating the Security of Large Language Models (2312.07876v1)
Abstract: LLMs such as GPT and Llama2 are increasingly adopted in many safety-critical applications. Their security is thus essential. Even with considerable efforts spent on reinforcement learning from human feedback (RLHF), recent studies have shown that LLMs are still subject to attacks such as adversarial perturbation and Trojan attacks. Further research is thus needed to evaluate their security and/or understand the lack of it. In this work, we propose a framework for conducting light-weight causality-analysis of LLMs at the token, layer, and neuron level. We applied our framework to open-source LLMs such as Llama2 and Vicuna and had multiple interesting discoveries. Based on a layer-level causality analysis, we show that RLHF has the effect of overfitting a model to harmful prompts. It implies that such security can be easily overcome by unusual' harmful prompts. As evidence, we propose an adversarial perturbation method that achieves 100\% attack success rate on the red-teaming tasks of the Trojan Detection Competition 2023. Furthermore, we show the existence of one mysterious neuron in both Llama2 and Vicuna that has an unreasonably high causal effect on the output. While we are uncertain on why such a neuron exists, we show that it is possible to conduct a
`Trojan'' attack targeting that particular neuron to completely cripple the LLM, i.e., we can generate transferable suffixes to prompts that frequently make the LLM produce meaningless responses.
- Detecting language model attacks with perplexity. arXiv preprint arXiv:2308.14132, 2023.
- Som S Biswas. Role of chat gpt in public health. Annals of biomedical engineering, 51(5):868–869, 2023.
- Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901, 2020.
- Neural network attributions: A causal perspective. In International Conference on Machine Learning, pages 981–990. PMLR, 2019.
- Vicuna: An open-source chatbot impressing gpt-4 with 90%* chatgpt quality. See https://vicuna. lmsys. org (accessed 14 April 2023), 2023.
- What causes a system to satisfy a specification? ACM Transactions on Computational Logic (TOCL), 9(3):1–26, 2008.
- Jailbreaker: Automated jailbreak across multiple large language model chatbots. arXiv preprint arXiv:2307.08715, 2023.
- Qlora: Efficient finetuning of quantized llms. arXiv preprint arXiv:2305.14314, 2023.
- Generative ai and chatgpt: Applications, challenges, and ai-human collaboration, 2023.
- Measuring skewness and kurtosis. Journal of the Royal Statistical Society Series D: The Statistician, 33(4):391–399, 1984.
- M A Hernán. A definition of causal effect for epidemiological research. Journal of Epidemiology & Community Health, 58(4):265–271, 2004.
- Actual causality canvas: a general framework for explanation-based socio-technical constructs. In ECAI 2020, pages 2978–2985. IOS Press, 2020.
- Baseline defenses for adversarial attacks against aligned language models. arXiv preprint arXiv:2309.00614, 2023.
- Causal testing: understanding defects’ root causes. In Proceedings of the ACM/IEEE 42nd International Conference on Software Engineering, pages 87–99, 2020.
- Automatically auditing large language models via discrete optimization. arXiv preprint arXiv:2303.04381, 2023.
- Multi-step jailbreaking privacy attacks on chatgpt. arXiv preprint arXiv:2304.05197, 2023.
- Trojaning attack on neural networks. In 25th Annual Network And Distributed System Security Symposium (NDSS 2018). Internet Soc, 2018.
- Casper LLM. Casper experiments data and code, 2023. https://casperllm.github.io/ [Accessed: 2023-11-28].
- Analyzing leakage of personally identifiable information in language models. arXiv preprint arXiv:2302.00539, 2023.
- Locating and editing factual associations in gpt. Advances in Neural Information Processing Systems, 35:17359–17372, 2022.
- Explaining deep learning models using causal inference. arXiv preprint arXiv:1811.04376, 2018.
- TDC 2023 Organizers. The trojan detection challenge 2023 (llm edition), 2023. https://trojandetection.ai/ [Accessed: 2023-11-28].
- Training language models to follow instructions with human feedback. Advances in Neural Information Processing Systems, 35:27730–27744, 2022.
- Judea Pearl. Direct and indirect effects. In Proceedings of the Seventeenth Conference on Uncertainty in Artificial Intelligence, page 411–420, 2001.
- Judea Pearl. Causality. Cambridge university press, 2009.
- Red teaming language models with language models. arXiv preprint arXiv:2202.03286, 2022.
- Two-in-One: A model hijacking attack against text generation models. In 32nd USENIX Security Symposium (USENIX Security 23), pages 2223–2240. USENIX Association, 2023.
- Causality-based neural network repair. In Proceedings of the 44th International Conference on Software Engineering, pages 338–349, 2022.
- Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288, 2023.
- Investigating gender bias in language models using causal mediation analysis. Advances in neural information processing systems, 33:12388–12401, 2020.
- Jailbroken: How does llm safety training fail? arXiv preprint arXiv:2307.02483, 2023.
- Fairness in decision-making—the causal explanation formula. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 32, 2018.
- Prompt as triggers for backdoor attack: Examining the vulnerability in language models. arXiv preprint arXiv:2305.01219, 2023.
- Universal and transferable adversarial attacks on aligned language models. arXiv preprint arXiv:2307.15043, 2023.