Red Teaming Language Model Detectors with Language Models (2305.19713v2)
Abstract: The prevalence and strong capability of LLMs present significant safety and ethical risks if exploited by malicious users. To prevent the potentially deceptive usage of LLMs, recent works have proposed algorithms to detect LLM-generated text and protect LLMs. In this paper, we investigate the robustness and reliability of these LLM detectors under adversarial attacks. We study two types of attack strategies: 1) replacing certain words in an LLM's output with their synonyms given the context; 2) automatically searching for an instructional prompt to alter the writing style of the generation. In both strategies, we leverage an auxiliary LLM to generate the word replacements or the instructional prompt. Different from previous works, we consider a challenging setting where the auxiliary LLM can also be protected by a detector. Experiments reveal that our attacks effectively compromise the performance of all detectors in the study with plausible generations, underscoring the urgent need to improve the robustness of LLM-generated text detection systems.
- Generating natural language adversarial examples. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 2890–2896, Brussels, Belgium.
- Exploring the landscape of distributional robustness for question answering models. In Findings of the Association for Computational Linguistics: EMNLP 2022, pages 5971–5987.
- Universal adversarial attacks on text classifiers. In ICASSP 2019-2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 7345–7349. IEEE.
- GPT-Neo: Large Scale Autoregressive Language Modeling with Mesh-Tensorflow. Zenodo.
- Extracting training data from large language models. In 30th USENIX Security Symposium (USENIX Security 21), pages 2633–2650.
- On the possibilities of ai-generated text detection. arXiv preprint arXiv:2304.04736.
- Palm: Scaling language modeling with pathways. arXiv preprint arXiv:2204.02311.
- BERT: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 4171–4186, Minneapolis, Minnesota.
- ELI5: long form question answering. In Proceedings of the 57th Conference of the Association for Computational Linguistics, ACL 2019, Florence, Italy, July 28- August 2, 2019, Volume 1: Long Papers, pages 3558–3567. Association for Computational Linguistics.
- Siddhant Garg and Goutham Ramakrishnan. 2020. Bae: Bert-based adversarial examples for text classification. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 6174–6181.
- Unsolved problems in ml safety. arXiv preprint arXiv:2109.13916.
- Deep anomaly detection with outlier exposure. In International Conference on Learning Representations.
- The curious case of neural text degeneration. In International Conference on Learning Representations.
- Adversarial example generation with syntactically controlled paraphrase networks. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), pages 1875–1885.
- Students parrot their teachers: Membership inference on model distillation. arXiv preprint arXiv:2303.03446.
- Is bert really robust? a strong baseline for natural language attack on text classification and entailment. In Proceedings of the AAAI conference on artificial intelligence.
- A watermark for large language models. In International Conference on Machine Learning, pages 17061–17084. PMLR.
- Paraphrasing evades detectors of ai-generated text, but retrieval is an effective defense. arXiv preprint arXiv:2303.13408.
- Bert-attack: Adversarial attack against bert using bert. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 6193–6202.
- A token-level reference-free hallucination detection benchmark for free-form text generation. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 6723–6737.
- The effect of natural distribution shift on question answering models. In International Conference on Machine Learning, pages 6905–6916. PMLR.
- Detectgpt: Zero-shot machine-generated text detection using probability curvature.
- Don’t give me the details, just the summary! topic-aware convolutional neural networks for extreme summarization. ArXiv, abs/1808.08745.
- OpenAI. 2023a. Ai text classifier.
- OpenAI. 2023b. Chatgpt. https://openai.com/blog/chatgpt/. Accessed on May 3, 2023.
- Red teaming language models with language models. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 3419–3448.
- Language models are unsupervised multitask learners. OpenAI blog, 1(8):9.
- Exploring the limits of transfer learning with a unified text-to-text transformer. The Journal of Machine Learning Research, 21(1):5485–5551.
- Generating natural language adversarial examples through probability weighted word saliency. In Proceedings of the 57th annual meeting of the association for computational linguistics, pages 1085–1097.
- Semantically equivalent adversarial rules for debugging nlp models. In Proceedings of the 56th annual meeting of the association for computational linguistics (volume 1: long papers), pages 856–865.
- Can ai-generated text be reliably detected? arXiv preprint arXiv:2303.11156.
- Zhouxing Shi and Minlie Huang. 2020. Robustness to modification with shared words in paraphrase identification. In Findings of the Association for Computational Linguistics: EMNLP 2020, pages 164–171, Online.
- Release strategies and the social impacts of language models. arXiv preprint arXiv:1908.09203.
- Edward Tian and Alexander Cui. 2023. Gptzero: Towards detection of ai-generated text using zero-shot and supervised methods.
- Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971.
- Universal adversarial triggers for attacking and analyzing nlp. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 2153–2162.
- Imitation attacks and defenses for black-box machine translation systems. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 5531–5546.
- Concealed data poisoning attacks on nlp models. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 139–150.
- On the robustness of language encoders against grammatical errors. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 3386–3403, Online. Association for Computational Linguistics.
- Defending against neural fake news. Advances in neural information processing systems, 32.
- Protecting language generation models via invisible watermarking. In Proceedings of the 40th International Conference on Machine Learning, volume 202 of Proceedings of Machine Learning Research, pages 42187–42199. PMLR.
- Detecting hallucinated content in conditional neural sequence generation. In Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021, pages 1393–1404.
- Large language models are human-level prompt engineers. arXiv preprint arXiv:2211.01910.
- Zhouxing Shi (19 papers)
- Yihan Wang (65 papers)
- Fan Yin (34 papers)
- Xiangning Chen (17 papers)
- Kai-Wei Chang (292 papers)
- Cho-Jui Hsieh (211 papers)