LimeAttack: Local Explainable Method for Textual Hard-Label Adversarial Attack (2308.00319v2)
Abstract: Natural language processing models are vulnerable to adversarial examples. Previous textual adversarial attacks adopt gradients or confidence scores to calculate word importance ranking and generate adversarial examples. However, this information is unavailable in the real world. Therefore, we focus on a more realistic and challenging setting, named hard-label attack, in which the attacker can only query the model and obtain a discrete prediction label. Existing hard-label attack algorithms tend to initialize adversarial examples by random substitution and then utilize complex heuristic algorithms to optimize the adversarial perturbation. These methods require a lot of model queries and the attack success rate is restricted by adversary initialization. In this paper, we propose a novel hard-label attack algorithm named LimeAttack, which leverages a local explainable method to approximate word importance ranking, and then adopts beam search to find the optimal solution. Extensive experiments show that LimeAttack achieves the better attacking performance compared with existing hard-label attack under the same query budget. In addition, we evaluate the effectiveness of LimeAttack on LLMs, and results indicate that adversarial examples remain a significant threat to LLMs. The adversarial examples crafted by LimeAttack are highly transferable and effectively improve model robustness in adversarial training.
- On the opportunities and risks of foundation models. arXiv preprint arXiv:2108.07258.
- large annotated corpus for learning natural language inference. In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, 632–642.
- Language models are few-shot learners. In Advances in neural information processing systems, volume 33, 1877–1901.
- Audio adversarial examples: Targeted attacks on speech-to-text. In 2018 IEEE Security and Privacy Workshops, 1–7.
- Additive Feature Attribution Explainable Methods to Craft Adversarial Attacks for Text Classification and Text Regression. IEEE Transactions on Knowledge and Data Engineering.
- BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In Annual Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, 4171–4186.
- Towards robustness against natural language word substitutions. International Conference on Learning Representations.
- FastWordBug: A fast method to generate adversarial text against NLP applications. arXiv preprint arXiv:2002.00760.
- Long short-term memory. In Neural Computation, volume 9, 1735–1780.
- SMART: Robust and Efficient Fine-Tuning for Pre-trained Natural Language Models through Principled Regularized Optimization. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, 2177–2190.
- Is bert really robust? a strong baseline for natural language attack on text classification and entailment. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 34, 8018–8025.
- Kim, Y. 2014. Convolutional Neural Networks for Sentence Classification. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing, 1746–1751.
- Large language models are zero-shot reasoners. arXiv preprint arXiv:2205.11916.
- Adversarial machine learning at scale. arXiv preprint arXiv:1611.01236.
- Adversarial examples in the physical world. In Artificial Intelligence Safety and Security, 99–112.
- Bart: Denoising sequence-to-sequence pre-training for natural language generation, translation, and comprehension. arXiv preprint arXiv:1910.13461.
- BERT-ATTACK: Adversarial Attack Against BERT Using BERT. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing, 6193–6202.
- A unified approach to interpreting model predictions. In Advances in neural information processing systems.
- Adversarial Text Generation via Probability Determined Word Saliency. In International Conference on Machine Learning for Cyber Security, 562–571.
- Generating natural language attacks in a hard label black box setting. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 35, 13525–13533.
- Deep learning–based text classification: a comprehensive review. In ACM Computing Surveys, volume 54, 1–40.
- TextAttack: A Framework for Adversarial Attacks, Data Augmentation, and Adversarial Training in NLP. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing, 119–126.
- Counter-fitting Word Vectors to Linguistic Constraints. In Annual Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, 142–148.
- Seeing Stars: Exploiting Class Relationships for Sentiment Categorization with Respect to Rating Scales. In Annual Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, 115–124.
- Practical black-box attacks against machine learning. In Proceedings of the 2017 ACM on Asia Conference on Computer and Communications Security, 506–519.
- Exploring the limits of transfer learning with a unified text-to-text transformer. In The Journal of Machine Learning Research, volume 21, 5485–5551.
- ” Why should i trust you?” Explaining the predictions of any classifier. In Proceedings of the 22nd ACM SIGKDD international conference on knowledge discovery and data mining, 1135–1144.
- Not just a black box: Learning important features through propagating activation differences. arXiv preprint arXiv:1605.01713.
- Recursive deep models for semantic compositionality over a sentiment treebank. In Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing, 1631–1642.
- Explaining prediction models and individual predictions with feature contributions. In Knowledge and information systems, volume 41, 647–665.
- Intriguing properties of neural networks. arXiv preprint arXiv:1312.6199.
- Adversarial glue: A multi-task benchmark for robustness evaluation of language models. arXiv preprint arXiv:2111.02840.
- On the robustness of chatgpt: An adversarial and out-of-distribution perspective. In arXiv preprint arXiv:2302.12095.
- A Broad-Coverage Challenge Corpus for Sentence Understanding through Inference. In Annual Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, 1112–1122.
- BESA: BERT-based Simulated Annealing for Adversarial Text Attacks. In International Joint Conference on Artificial Intelligence, 3293–3299.
- LeapAttack: Hard-Label Adversarial Attack on Text via Gradient-Based Optimization. In Proceedings of the 28th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, 2307–2315.
- TextHoaxer: Budgeted Hard-Label Adversarial Attacks on Text. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 36, 3877–3884.
- Searching for a Search Method: Benchmarking Search Algorithms for Generating NLP Adversarial Examples. In Proceedings of the Third BlackboxNLP Workshop on Analyzing and Interpreting Neural Networks for NLP, 323–332.
- Towards Improving Adversarial Training of NLP Models. Findings of the Association for Computational Linguistics: EMNLP.
- Learning-based Hybrid Local Search for the Hard-label Textual Attack. arXiv preprint arXiv:2201.08193.
- Character-level convolutional networks for text classification. Advances in neural information processing systems.
- BeamAttack: Generating High-quality Textual Adversarial Examples through Beam Search and Mixed Semantic Spaces. arXiv preprint arXiv:2303.07199.
- Hai Zhu (33 papers)
- Zhaoqing Yang (2 papers)
- Weiwei Shang (3 papers)
- Yuren Wu (2 papers)