Enhancing Jailbreak Attacks with Diversity Guidance (2403.00292v2)
Abstract: As LLMs(LLMs) become commonplace in practical applications, the security issues of LLMs have attracted societal concerns. Although extensive efforts have been made to safety alignment, LLMs remain vulnerable to jailbreak attacks. We find that redundant computations limit the performance of existing jailbreak attack methods. Therefore, we propose DPP-based Stochastic Trigger Searching (DSTS), a new optimization algorithm for jailbreak attacks. DSTS incorporates diversity guidance through techniques including stochastic gradient search and DPP selection during optimization. Detailed experiments and ablation studies demonstrate the effectiveness of the algorithm. Moreover, we use the proposed algorithm to compute the risk boundaries for different LLMs, providing a new perspective on LLM safety evaluation.
- Learning the parameters of determinantal point process kernels. In International Conference on Machine Learning, pages 1224–1232. PMLR.
- Nuanced metrics for measuring unintended bias with real data for text classification. In Companion proceedings of the 2019 world wide web conference, pages 491–500.
- Badprompt: Backdoor attacks on continuous prompts. Advances in Neural Information Processing Systems, 35:37068–37080.
- Extracting training data from large language models.
- Fast greedy map inference for determinantal point process to improve recommendation diversity. Advances in Neural Information Processing Systems, 31.
- Seq2sick: Evaluating the robustness of sequence-to-sequence models with adversarial examples. In Proceedings of the AAAI conference on artificial intelligence, volume 34, pages 3601–3608.
- Improving the similarity measure of determinantal point processes for extractive multi-document summarization. arXiv preprint arXiv:1906.00072.
- Multi-document summarization with determinantal point processes and contextualized representations. arXiv preprint arXiv:1910.11411.
- Control prefixes for parameter-efficient text generation. arXiv preprint arXiv:2110.08329.
- RLPrompt: Optimizing discrete text prompts with reinforcement learning. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 3369–3391, Abu Dhabi, United Arab Emirates. Association for Computational Linguistics.
- HotFlip: White-box adversarial examples for text classification. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), pages 31–36, Melbourne, Australia. Association for Computational Linguistics.
- RealToxicityPrompts: Evaluating neural toxic degeneration in language models. In Findings of the Association for Computational Linguistics: EMNLP 2020, pages 3356–3369, Online. Association for Computational Linguistics.
- Hafez: an interactive poetry generation system. In Proceedings of ACL 2017, System Demonstrations, pages 43–48, Vancouver, Canada. Association for Computational Linguistics.
- Near-optimal map inference for determinantal point processes. Advances in Neural Information Processing Systems, 25.
- Explaining and harnessing adversarial examples. arXiv preprint arXiv:1412.6572.
- Alex Graves. 2012. Sequence transduction with recurrent neural networks. arXiv preprint arXiv:1211.3711.
- Gradient-based adversarial attacks against text transformers. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 5747–5757, Online and Punta Cana, Dominican Republic. Association for Computational Linguistics.
- Don’t stop pretraining: Adapt language models to domains and tasks. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 8342–8360, Online. Association for Computational Linguistics.
- Learning to write with cooperative discriminators. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1638–1649, Melbourne, Australia. Association for Computational Linguistics.
- Categorical reparameterization with gumbel-softmax.
- Automatically auditing large language models via discrete optimization.
- Determinantal point processes for machine learning. Foundations and Trends® in Machine Learning, 5(2–3):123–286.
- Gradient-based constrained sampling from language models. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 2251–2277, Abu Dhabi, United Arab Emirates. Association for Computational Linguistics.
- Weight poisoning attacks on pretrained models. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 2793–2806, Online. Association for Computational Linguistics.
- Bert-attack: Adversarial attack against bert using bert. arXiv preprint arXiv:2004.09984.
- Backdoor attacks on pre-trained models by layerwise weight poisoning. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 3023–3032, Online and Punta Cana, Dominican Republic. Association for Computational Linguistics.
- Diffusion-lm improves controllable text generation. Advances in Neural Information Processing Systems, 35:4328–4343.
- BOLT: Fast energy-based controlled text generation with tunable biases. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), pages 186–200, Toronto, Canada. Association for Computational Linguistics.
- Odile Macchi. 1975. The coincidence approach to stochastic point processes. Advances in Applied Probability, 7(1):83–122.
- Generating natural language attacks in a hard label black box setting. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 35, pages 13525–13533.
- Zelda Elaine Mariet. 2016. Learning and enforcing diversity with Determinantal Point Processes. Ph.D. thesis, Massachusetts Institute of Technology.
- Rada Mihalcea and Paul Tarau. 2004. Textrank: Bringing order into text. In Proceedings of the 2004 conference on empirical methods in natural language processing, pages 404–411.
- Red teaming language models with language models.
- Laura Perez-Beltrachini and Mirella Lapata. 2021. Multi-document summarization with determinantal point process attention. Journal of Artificial Intelligence Research, 71:371–399.
- Controllable natural language generation with contrastive prefixes. In Findings of the Association for Computational Linguistics: ACL 2022, pages 2912–2924, Dublin, Ireland. Association for Computational Linguistics.
- Cold decoding: Energy-based constrained text generation with langevin dynamics. Advances in Neural Information Processing Systems, 35:9538–9551.
- Language models are unsupervised multitask learners.
- Beyond accuracy: Behavioral testing of NLP models with CheckList. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 4902–4912, Online. Association for Computational Linguistics.
- Autoprompt: Eliciting knowledge from language models with automatically generated prompts. arXiv preprint arXiv:2010.15980.
- Sequence to sequence learning with neural networks. Advances in neural information processing systems, 27.
- Intriguing properties of neural networks. arXiv preprint arXiv:1312.6199.
- Stanford alpaca: An instruction-following llama model. https://github.com/tatsu-lab/stanford_alpaca.
- Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971.
- Universal adversarial triggers for attacking and analyzing NLP. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 2153–2162, Hong Kong, China. Association for Computational Linguistics.
- Ben Wang and Aran Komatsuzaki. 2021. Gpt-j-6b: A 6 billion parameter autoregressive language model. https://github.com/kingoflolz/mesh-transformer-jax.
- Exploring the limits of domain-adaptive training for detoxifying large-scale language models. Advances in Neural Information Processing Systems, 35:35811–35824.
- Exploring the universal vulnerability of prompt-based learning paradigm. In Findings of the Association for Computational Linguistics: NAACL 2022, pages 1799–1810, Seattle, United States. Association for Computational Linguistics.
- Kevin Yang and Dan Klein. 2021. FUDGE: Controlled text generation with future discriminators. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 3511–3535, Online. Association for Computational Linguistics.
- Be careful about poisoned word embeddings: Exploring the vulnerability of the embedding layers in NLP models. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 2048–2058, Online. Association for Computational Linguistics.
- Opt: Open pre-trained transformer language models.
- MIL-decoding: Detoxifying language models at token-level via multiple instance learning. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 190–202, Toronto, Canada. Association for Computational Linguistics.
- Judging llm-as-a-judge with mt-bench and chatbot arena.
- Universal and transferable adversarial attacks on aligned language models. arXiv preprint arXiv:2307.15043.