Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
119 tokens/sec
GPT-4o
56 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Evaluating and Safeguarding the Adversarial Robustness of Retrieval-Based In-Context Learning (2405.15984v4)

Published 24 May 2024 in cs.CL and cs.AI

Abstract: With the emergence of LLMs, such as LLaMA and OpenAI GPT-3, In-Context Learning (ICL) gained significant attention due to its effectiveness and efficiency. However, ICL is very sensitive to the choice, order, and verbaliser used to encode the demonstrations in the prompt. Retrieval-Augmented ICL methods try to address this problem by leveraging retrievers to extract semantically related examples as demonstrations. While this approach yields more accurate results, its robustness against various types of adversarial attacks, including perturbations on test samples, demonstrations, and retrieved data, remains under-explored. Our study reveals that retrieval-augmented models can enhance robustness against test sample attacks, outperforming vanilla ICL with a 4.87% reduction in Attack Success Rate (ASR); however, they exhibit overconfidence in the demonstrations, leading to a 2% increase in ASR for demonstration attacks. Adversarial training can help improve the robustness of ICL methods to adversarial attacks; however, such a training scheme can be too costly in the context of LLMs. As an alternative, we introduce an effective training-free adversarial defence method, DARD, which enriches the example pool with those attacked samples. We show that DARD yields improvements in performance and robustness, achieving a 15% reduction in ASR over the baselines. Code and data are released to encourage further research: https://github.com/simonucl/adv-retreival-icl

Definition Search Book Streamline Icon: https://streamlinehq.com
References (64)
  1. Reliable, adaptable, and attributable language models with retrieval. 2024. URL https://api.semanticscholar.org/CorpusID:268248911.
  2. Language models are few-shot learners. ArXiv, abs/2005.14165, 2020. URL https://api.semanticscholar.org/CorpusID:218971783.
  3. On the relation between sensitivity and accuracy in in-context learning. In Houda Bouamor, Juan Pino, and Kalika Bali (eds.), Findings of the Association for Computational Linguistics: EMNLP 2023, pp.  155–167, Singapore, December 2023. Association for Computational Linguistics. doi: 10.18653/v1/2023.findings-emnlp.12. URL https://aclanthology.org/2023.findings-emnlp.12.
  4. Vicuna: An open-source chatbot impressing gpt-4 with 90%* chatgpt quality, March 2023. URL https://lmsys.org/blog/2023-03-30-vicuna/.
  5. The pascal recognising textual entailment challenge. In Machine Learning Challenges Workshop, 2005. URL https://api.semanticscholar.org/CorpusID:8587959.
  6. Why can gpt learn in-context? language models implicitly perform gradient descent as meta-optimizers. 2022. URL https://api.semanticscholar.org/CorpusID:258686544.
  7. Tri Dao. Flashattention-2: Faster attention with better parallelism and work partitioning. ArXiv, abs/2307.08691, 2023. URL https://api.semanticscholar.org/CorpusID:259936734.
  8. Bert: Pre-training of deep bidirectional transformers for language understanding. In North American Chapter of the Association for Computational Linguistics, 2019. URL https://api.semanticscholar.org/CorpusID:52967399.
  9. Robust physical-world attacks on deep learning models. arXiv: Cryptography and Security, 2017. URL https://api.semanticscholar.org/CorpusID:4730292.
  10. Gptq: Accurate post-training quantization for generative pre-trained transformers. ArXiv, abs/2210.17323, 2022. URL https://api.semanticscholar.org/CorpusID:253237200.
  11. Gemma. 2024. doi: 10.34740/KAGGLE/M/3301. URL https://www.kaggle.com/m/3301.
  12. A survey of adversarial defences and robustness in nlp. 2022. URL https://api.semanticscholar.org/CorpusID:247447518.
  13. Robustness of learning from task instructions. In Annual Meeting of the Association for Computational Linguistics, 2022. URL https://api.semanticscholar.org/CorpusID:254366640.
  14. news-please: A generic news crawler and extractor. In Proceedings of the 15th International Symposium of Information Science, pp.  218–223, March 2017. doi: 10.5281/zenodo.4120316.
  15. Decoding compressed trust: Scrutinizing the trustworthiness of efficient llms under compression. 2024. URL https://api.semanticscholar.org/CorpusID:268680727.
  16. Mining and summarizing customer reviews. Proceedings of the tenth ACM SIGKDD international conference on Knowledge discovery and data mining, 2004. URL https://api.semanticscholar.org/CorpusID:207155218.
  17. Mixtral of experts. ArXiv, abs/2401.04088, 2024. URL https://api.semanticscholar.org/CorpusID:266844877.
  18. Mistral 7b. ArXiv, abs/2310.06825, 2023. URL https://api.semanticscholar.org/CorpusID:263830494.
  19. Is bert really robust? a strong baseline for natural language attack on text classification and entailment. In AAAI Conference on Artificial Intelligence, 2019. URL https://api.semanticscholar.org/CorpusID:202539059.
  20. Is bert really robust? a strong baseline for natural language attack on text classification and entailment. Proceedings of the AAAI Conference on Artificial Intelligence, 34(05):8018–8025, Apr. 2020. doi: 10.1609/aaai.v34i05.6311. URL https://ojs.aaai.org/index.php/AAAI/article/view/6311.
  21. Textbugger: Generating adversarial text against real-world applications. ArXiv, abs/1812.05271, 2018. URL https://api.semanticscholar.org/CorpusID:54815878.
  22. Bert-attack: Adversarial attack against bert using bert. ArXiv, abs/2004.09984, 2020. URL https://api.semanticscholar.org/CorpusID:216036179.
  23. Unified demonstration retriever for in-context learning. ArXiv, abs/2305.04320, 2023. URL https://api.semanticscholar.org/CorpusID:258557751.
  24. Searching for an effective defender: Benchmarking defense against adversarial word substitution. In Marie-Francine Moens, Xuanjing Huang, Lucia Specia, and Scott Wen-tau Yih (eds.), Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pp.  3137–3147, Online and Punta Cana, Dominican Republic, November 2021. Association for Computational Linguistics. doi: 10.18653/v1/2021.emnlp-main.251. URL https://aclanthology.org/2021.emnlp-main.251.
  25. Unsupervised cross-task generalization via retrieval augmentation. In S. Koyejo, S. Mohamed, A. Agarwal, D. Belgrave, K. Cho, and A. Oh (eds.), Advances in Neural Information Processing Systems, volume 35, pp.  22003–22017. Curran Associates, Inc., 2022. URL https://proceedings.neurips.cc/paper_files/paper/2022/file/8a0d3ae989a382ce6e50312bc35bf7e1-Paper-Conference.pdf.
  26. Dual operating modes of in-context learning. 2024. URL https://api.semanticscholar.org/CorpusID:268063278.
  27. What makes good in-context examples for GPT-3? In Eneko Agirre, Marianna Apidianaki, and Ivan Vulić (eds.), Proceedings of Deep Learning Inside Out (DeeLIO 2022): The 3rd Workshop on Knowledge Extraction and Integration for Deep Learning Architectures, pp.  100–114, Dublin, Ireland and Online, May 2022. Association for Computational Linguistics. doi: 10.18653/v1/2022.deelio-1.10. URL https://aclanthology.org/2022.deelio-1.10.
  28. Fantastically ordered prompts and where to find them: Overcoming few-shot prompt order sensitivity. In Smaranda Muresan, Preslav Nakov, and Aline Villavicencio (eds.), Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp.  8086–8098, Dublin, Ireland, May 2022. Association for Computational Linguistics. doi: 10.18653/v1/2022.acl-long.556. URL https://aclanthology.org/2022.acl-long.556.
  29. Dr.icl: Demonstration-retrieved in-context learning. ArXiv, abs/2305.14128, 2023. URL https://api.semanticscholar.org/CorpusID:258841276.
  30. Rethinking the role of demonstrations: What makes in-context learning work? In Yoav Goldberg, Zornitsa Kozareva, and Yue Zhang (eds.), Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pp.  11048–11064, Abu Dhabi, United Arab Emirates, December 2022. Association for Computational Linguistics. doi: 10.18653/v1/2022.emnlp-main.759. URL https://aclanthology.org/2022.emnlp-main.759.
  31. Textattack: A framework for adversarial attacks, data augmentation, and adversarial training in nlp. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pp.  119–126, 2020.
  32. Bo Pang and Lillian Lee. Seeing stars: Exploiting class relationships for sentiment categorization with respect to rating scales. In Kevin Knight, Hwee Tou Ng, and Kemal Oflazer (eds.), Proceedings of the 43rd Annual Meeting of the Association for Computational Linguistics (ACL’05), pp.  115–124, Ann Arbor, Michigan, June 2005. Association for Computational Linguistics. doi: 10.3115/1219840.1219855. URL https://aclanthology.org/P05-1015.
  33. On the adversarial robustness of mixture of experts. ArXiv, abs/2210.10253, 2022. URL https://api.semanticscholar.org/CorpusID:252992497.
  34. Model-tuning via prompts makes nlp models adversarially robust. ArXiv, abs/2303.07320, 2023. URL https://api.semanticscholar.org/CorpusID:257495746.
  35. Sentence-bert: Sentence embeddings using siamese bert-networks. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, 11 2019. URL https://arxiv.org/abs/1908.10084.
  36. The probabilistic relevance framework: Bm25 and beyond. Found. Trends Inf. Retr., 3:333–389, 2009. URL https://api.semanticscholar.org/CorpusID:207178704.
  37. Smoothllm: Defending large language models against jailbreaking attacks. ArXiv, abs/2310.03684, 2023. URL https://api.semanticscholar.org/CorpusID:263671542.
  38. Learning to retrieve prompts for in-context learning. ArXiv, abs/2112.08633, 2021. URL https://api.semanticscholar.org/CorpusID:245218561.
  39. Large language models can be easily distracted by irrelevant context. In Andreas Krause, Emma Brunskill, Kyunghyun Cho, Barbara Engelhardt, Sivan Sabato, and Jonathan Scarlett (eds.), Proceedings of the 40th International Conference on Machine Learning, volume 202 of Proceedings of Machine Learning Research, pp.  31210–31227. PMLR, 23–29 Jul 2023. URL https://proceedings.mlr.press/v202/shi23a.html.
  40. Nearest neighbor zero-shot inference. In Yoav Goldberg, Zornitsa Kozareva, and Yue Zhang (eds.), Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pp.  3254–3265, Abu Dhabi, United Arab Emirates, December 2022. Association for Computational Linguistics. doi: 10.18653/v1/2022.emnlp-main.214. URL https://aclanthology.org/2022.emnlp-main.214.
  41. Better robustness by more coverage: Adversarial and mixup data augmentation for robust finetuning. In Chengqing Zong, Fei Xia, Wenjie Li, and Roberto Navigli (eds.), Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021, pp.  1569–1576, Online, August 2021. Association for Computational Linguistics. doi: 10.18653/v1/2021.findings-acl.137. URL https://aclanthology.org/2021.findings-acl.137.
  42. Recursive deep models for semantic compositionality over a sentiment treebank. In Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing, pp.  1631–1642, Seattle, Washington, USA, October 2013. Association for Computational Linguistics. URL https://www.aclweb.org/anthology/D13-1170.
  43. One embedder, any task: Instruction-finetuned text embeddings. 2022. URL https://arxiv.org/abs/2212.09741.
  44. Evaluating the zero-shot robustness of instruction-tuned language models. ArXiv, abs/2306.11270, 2023. URL https://api.semanticscholar.org/CorpusID:259203613.
  45. Intriguing properties of neural networks. CoRR, abs/1312.6199, 2013. URL https://api.semanticscholar.org/CorpusID:604334.
  46. Stanford alpaca: An instruction-following llama model. https://github.com/tatsu-lab/stanford_alpaca, 2023.
  47. Llama 2: Open foundation and fine-tuned chat models. ArXiv, abs/2307.09288, 2023. URL https://api.semanticscholar.org/CorpusID:259950998.
  48. Ensemble adversarial training: Attacks and defenses. ArXiv, abs/1705.07204, 2017. URL https://api.semanticscholar.org/CorpusID:21946795.
  49. The TREC-8 question answering track. In M. Gavrilidou, G. Carayannis, S. Markantonatou, S. Piperidis, and G. Stainhauer (eds.), Proceedings of the Second International Conference on Language Resources and Evaluation (LREC’00), Athens, Greece, May 2000. European Language Resources Association (ELRA). URL http://www.lrec-conf.org/proceedings/lrec2000/pdf/26.pdf.
  50. Decodingtrust: A comprehensive assessment of trustworthiness in gpt models. ArXiv, abs/2306.11698, 2023a. URL https://api.semanticscholar.org/CorpusID:259202782.
  51. Adversarial demonstration attacks on large language models. ArXiv, abs/2305.14950, 2023b. URL https://api.semanticscholar.org/CorpusID:258865399.
  52. Learning to retrieve in-context examples for large language models. ArXiv, abs/2307.07164, 2023c. URL https://api.semanticscholar.org/CorpusID:259924840.
  53. Finetuned language models are zero-shot learners. In International Conference on Learning Representations, 2022. URL https://openreview.net/forum?id=gEZrGCozdqR.
  54. A broad-coverage challenge corpus for sentence understanding through inference. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), pp.  1112–1122. Association for Computational Linguistics, 2018. URL http://aclweb.org/anthology/N18-1101.
  55. An explanation of in-context learning as implicit bayesian inference. ArXiv, abs/2111.02080, 2021. URL https://api.semanticscholar.org/CorpusID:241035330.
  56. knn prompting: Beyond-context learning with calibration-free nearest neighbor inference. ArXiv, abs/2303.13824, 2023a. URL https://api.semanticscholar.org/CorpusID:257756989.
  57. Cognitive overload: Jailbreaking large language models with overloaded logical thinking. ArXiv, abs/2311.09827, 2023b. URL https://api.semanticscholar.org/CorpusID:265221395.
  58. Openmoe: An early effort on open mixture-of-experts language models. arXiv preprint arXiv:2402.01739, 2024.
  59. In and out-of-domain text adversarial robustness via label smoothing. ArXiv, abs/2212.10258, 2022. URL https://api.semanticscholar.org/CorpusID:254877265.
  60. Compositional exemplars for in-context learning. In International Conference on Machine Learning, 2023. URL https://api.semanticscholar.org/CorpusID:256826793.
  61. Ground-truth labels matter: A deeper look into input-label demonstrations. In Yoav Goldberg, Zornitsa Kozareva, and Yue Zhang (eds.), Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pp.  2422–2437, Abu Dhabi, United Arab Emirates, December 2022. Association for Computational Linguistics. doi: 10.18653/v1/2022.emnlp-main.155. URL https://aclanthology.org/2022.emnlp-main.155.
  62. How johnny can persuade llms to jailbreak them: Rethinking persuasion to challenge ai safety by humanizing llms. ArXiv, abs/2401.06373, 2024. URL https://api.semanticscholar.org/CorpusID:266977395.
  63. Promptbench: Towards evaluating the robustness of large language models on adversarial prompts. ArXiv, abs/2306.04528, 2023. URL https://api.semanticscholar.org/CorpusID:259095572.
  64. Universal and transferable adversarial attacks on aligned language models. ArXiv, abs/2307.15043, 2023. URL https://api.semanticscholar.org/CorpusID:260202961.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (4)
  1. Jie He (50 papers)
  2. Pasquale Minervini (88 papers)
  3. Jeff Z. Pan (78 papers)
  4. Simon Yu (14 papers)

Summary

We haven't generated a summary for this paper yet.

X Twitter Logo Streamline Icon: https://streamlinehq.com