TRAP: Targeted Random Adversarial Prompt Honeypot for Black-Box Identification (2402.12991v2)
Abstract: LLM services and models often come with legal rules on who can use them and how they must use them. Assessing the compliance of the released LLMs is crucial, as these rules protect the interests of the LLM contributor and prevent misuse. In this context, we describe the novel fingerprinting problem of Black-box Identity Verification (BBIV). The goal is to determine whether a third-party application uses a certain LLM through its chat function. We propose a method called Targeted Random Adversarial Prompt (TRAP) that identifies the specific LLM in use. We repurpose adversarial suffixes, originally proposed for jailbreaking, to get a pre-defined answer from the target LLM, while other models give random answers. TRAP detects the target LLMs with over 95% true positive rate at under 0.2% false positive rate even after a single interaction. TRAP remains effective even if the LLM has minor changes that do not significantly alter the original function.
- Aaditya Bhat. 2023. Gpt-wiki-intro (revision 0e458f5).
- Sahar Abdelnabi and Mario Fritz. 2021. Adversarial watermarking transformer: Towards tracing text provenance with data hiding. In 42nd IEEE Symposium on Security and Privacy, SP 2021, San Francisco, CA, USA, 24-27 May 2021, pages 121–140. IEEE.
- On the possibilities of ai-generated text detection. arXiv preprint arXiv:2304.04736.
- GPT-Sentinel: Distinguishing Human and ChatGPT Generated Content. ArXiv:2305.07969 [cs].
- Brian Christian. 2020. The alignment problem: Machine learning and human values. WW Norton & Company.
- Deep reinforcement learning from human preferences. Advances in neural information processing systems, 30.
- Detecting chatgpt: A survey of the state of detecting chatgpt-generated text. arXiv preprint arXiv:2309.07689.
- Hierarchical Neural Story Generation. ArXiv:1805.04833 [cs].
- Bridging the gap: A survey on integrating (human) feedback for natural language generation. arXiv preprint arXiv:2305.00955.
- GLTR: statistical detection and visualization of generated text. In Proceedings of the 57th Conference of the Association for Computational Linguistics, ACL 2019, Florence, Italy, July 28 - August 2, 2019, Volume 3: System Demonstrations, pages 111–116. Association for Computational Linguistics.
- Towards possibilities & impossibilities of ai-generated text detection: A survey. arXiv preprint arXiv:2310.15264.
- Unbiased watermark for large language models. arXiv preprint arXiv:2310.10669.
- Prompt perturbation in retrieval-augmented generation based large language models. arXiv preprint arXiv:2402.07179.
- Baseline Defenses for Adversarial Attacks Against Aligned Language Models. ArXiv:2309.00614 [cs].
- Human or Not? A Gamified Approach to the Turing Test. ArXiv:2305.20010 [cs].
- PubMedQA: A Dataset for Biomedical Research Question Answering. ArXiv:1909.06146 [cs, q-bio].
- Cameron Jones and Benjamin Bergen. 2023. Does GPT-4 Pass the Turing Test? ArXiv:2310.20216 [cs].
- A watermark for large language models. arXiv preprint arXiv:2301.10226.
- Meta. 2023. Llama 2 license agreement. https://ai.meta.com/llama/license/. Accessed: 2010-09-30.
- Detectgpt: Zero-shot machine-generated text detection using probability curvature. In International Conference on Machine Learning, ICML 2023, 23-29 July 2023, Honolulu, Hawaii, USA, volume 202 of Proceedings of Machine Learning Research, pages 24950–24962. PMLR.
- Orca 2: Teaching small language models how to reason.
- OpenAI. 2024. Usage policies.
- Direct preference optimization: Your language model is secretly a reward model. arXiv preprint arXiv:2305.18290.
- Can ai-generated text be reliably detected? arXiv preprint arXiv:2303.11156.
- Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288.
- Zephyr: Direct distillation of lm alignment. arXiv preprint arXiv:2310.16944.
- Kai-Cheng Yang and Filippo Menczer. 2023. Anatomy of an AI-powered malicious social botnet. ArXiv:2307.16336 [cs].
- Eliezer Yudkowsky. 2016. The ai alignment problem: why it is hard, and where to start. Symbolic Systems Distinguished Speaker, 4.
- Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena. ArXiv:2306.05685 [cs].
- Fine-tuning language models from human preferences. arXiv preprint arXiv:1909.08593.
- Universal and Transferable Adversarial Attacks on Aligned Language Models. ArXiv:2307.15043 [cs].
Collections
Sign up for free to add this paper to one or more collections.