Exploiting Novel GPT-4 APIs (2312.14302v2)
Abstract: LLM attacks typically assume one of two extreme threat models: full white-box access to model weights, or black-box access limited to a text generation API. However, real-world APIs are often more flexible than just text generation: these APIs expose "gray-box" access leading to new threat vectors. To explore this, we red-team three new functionalities exposed in the GPT-4 APIs: fine-tuning, function calling and knowledge retrieval. We find that fine-tuning a model on as few as 15 harmful examples or 100 benign examples can remove core safeguards from GPT-4, enabling a range of harmful outputs. Furthermore, we find that GPT-4 Assistants readily divulge the function call schema and can be made to execute arbitrary function calls. Finally, we find that knowledge retrieval can be hijacked by injecting instructions into retrieval documents. These vulnerabilities highlight that any additions to the functionality exposed by an API can create new vulnerabilities.
- Not what you’ve signed up for: Compromising real-world LLM-integrated applications with indirect prompt injection. In AISec, page 79–90, 2023.
- PaLM 2 technical report. arXiv preprint arXiv:2305.10403, 2023.
- Mohamed Azharudeen. Fine-tuning the GPT 3.5 Turbo: A Python code deep dive, Oct 2023. URL https://medium.com/ai-insights-cobet/fine-tuning-the-gpt-3-5-turbo-a-python-code-deep-dive-d0271c613444.
- Training a helpful and harmless assistant with reinforcement learning from human feedback. arXiv preprint arXiv:2204.05862, 2022.
- Addressing "documentation debt" in machine learning research: A retrospective datasheet for BookCorpus. arXiv preprint arXiv:2105.05241, 2021.
- Explore, establish, exploit: Red teaming language models from scratch. arXiv preprint arXiv:2306.09442, 2023.
- A survey on evaluation of large language models. arXiv preprint arXiv:2307.03109, 2023.
- Can LLM-generated misinformation be detected? arXiv preprint arXiv:2309.13788, 2023a.
- Combating misinformation in the age of LLMs: Opportunities and challenges. arXiv preprint arXiv:2311.05656, 2023b.
- Can language models be instructed to protect personal information? arXiv preprint arXiv:2310.02224, 2023.
- Red teaming language models to reduce harms: Methods, scaling behaviors, and lessons learned. arXiv preprint arXiv:2209.07858, 2022.
- Measuring massive multitask language understanding. In ICLR, 2021.
- Mechanistically analyzing the effects of fine-tuning on procedurally defined tasks, 2023. arXiv preprint arXiv:2311.12786.
- BeaverTails: Towards improved safety alignment of LLM via a human-preference dataset. In NeurIPS Datasets and Benchmarks Track, 2023.
- A review of false data injection attacks against modern power systems. IEEE Transactions on Smart Grid, 8(4):1630–1638, 2016.
- Trustworthy LLMs: a survey and guideline for evaluating large language models’ alignment. arXiv preprint arXiv:2308.05374, 2023.
- The Operational Risks of AI in Large-Scale Biological Attacks: A Red-Team Approach. RAND Corporation, 2023.
- A systematic review of detection and prevention techniques of SQL injection attacks. Information Security Journal: A Global Perspective, 32(4):252–265, 2023.
- National Institute of Standards and Technology. In LangChain through 0.0.131, the LLMMathChain chain allows prompt injection attacks that can execute arbitrary code via the python exec method. https://nvd.nist.gov/vuln/detail/CVE-2023-29374, 2023a. Accessed: December 14, 2023.
- National Institute of Standards and Technology. SQL injection vulnerability in langchain before v0.0.247 allows a remote attacker to obtain sensitive information via the SQLDatabaseChain component. https://nvd.nist.gov/vuln/detail/CVE-2023-36189, 2023b. Accessed: December 14, 2023.
- National Institute of Standards and Technology. An issue in langchain v.0.0.199 allows an attacker to execute arbitrary code via the PALChain in the python exec method. https://nvd.nist.gov/vuln/detail/CVE-2023-36258, 2023c. Accessed: December 14, 2023.
- OpenAI. Assistants API documentation, 2023a. URL https://platform.openai.com/docs/assistants/overview.
- OpenAI. Fine-tuning API documentation, 2023b. URL https://platform.openai.com/docs/guides/fine-tuning.
- OpenAI. GPT-4 technical report, 2023c. arXiv preprint arXiv:2303.08774.
- OpenAI. Knowledge retrieval documentation, 2023d. URL https://platform.openai.com/docs/assistants/tools/knowledge-retrieval.
- Training language models to follow instructions with human feedback. In NeurIPS, volume 35, pages 27730–27744, 2022.
- OWASP. Cross site scripting (XSS), 2022. URL https://owasp.org/www-community/attacks/xss/.
- Red teaming language models with language models. In EMNLP, December 2022.
- Ignore previous prompt: Attack techniques for language models. arXiv preprint arXiv:2211.09527, 2022.
- Avram Piltch. ChatGPT’s new code interpreter has giant security hole, allows hackers to steal your data. Tom’s Hardware, 2023. URL https://www.tomshardware.com/news/chatgpt-code-interpreter-security-hole.
- Fine-tuning aligned language models compromises safety, even when users do not intend to! arXiv preprint arXiv:2310.03693, 2023.
- A causal framework for AI regulation and auditing. Technical report, Apollo Research, 2023.
- Towards understanding sycophancy in language models. arXiv preprint arXiv:2310.13548, 2023.
- “Do Anything Now”’: Characterizing and evaluating in-the-wild jailbreak prompts on large language models. arXiv preprint arXiv:2308.03825, 2023.
- AI model GPT-3 (dis)informs us better than humans. Science Advances, 9(26), 2023.
- Stanford Alpaca: An instruction-following LLaMA model. https://github.com/tatsu-lab/stanford_alpaca, 2023.
- Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288, 2023.
- Aligning large language models with human: A survey. arXiv preprint arXiv:2307.12966, 2023.
- Taxonomy of risks posed by language models. In FAccT, page 214–229, 2022.
- Shadow alignment: The ease of subverting safely-aligned language models. arXiv preprint arXiv:2310.02949, 2023.
- Low-resource languages jailbreak GPT-4. arXiv preprint arXiv:2310.02446, 2023.
- Removing RLHF protections in GPT-4 via fine-tuning. arXiv preprint arXiv:2311.05553, 2023.
- Synthetic lies: Understanding AI-generated misinformation and evaluating algorithmic and human solutions. In CHI, pages 1–20, 2023.
- Universal and transferable adversarial attacks on aligned language models. arXiv preprint arXiv:2307.15043, 2023.
Collections
Sign up for free to add this paper to one or more collections.
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.