TrojFM: Resource-efficient Backdoor Attacks against Very Large Foundation Models (2405.16783v1)
Abstract: One key challenge in backdoor attacks against large foundation models is the resource limits. Backdoor attacks usually require retraining the target model, which is impractical for very large foundation models. Existing backdoor attacks are mainly designed for supervised classifiers or small foundation models (e.g., BERT). None of these attacks has successfully compromised a very large foundation model, such as Llama-3-70B, especially with limited computational resources. In this paper, we propose TrojFM, a novel backdoor attack tailored for very large foundation models. Our primary technical contribution is the development of a novel backdoor injection method. This method forces a backdoored model to generate similar hidden representations for poisoned inputs regardless of their actual semantics. Our approach injects such backdoors by fine-tuning only a very small proportion of model parameters. This enables TrojFM to efficiently launch downstream task-agnostic backdoor attacks against very large foundation models under limited computational resources. Moreover, we optimize the fine-tuning process with our customized QLoRA technique, enabling launching our attack via only~\textit{one A100 GPU}. Furthermore, we design a new trigger injection method to ensure our attack stealthiness. Through extensive experiments, we first demonstrate that TrojFM can launch effective backdoor attacks against widely used large GPT-style models without jeopardizing their normal functionalities (and outperforming existing attacks on BERT-style models). Furthermore, we show that TrojFM is resilient to SOTA defenses and is insensitive to changes in key hyper-parameters. Finally, we conduct a resource analysis to quantify that our method can significantly save computational and memory costs compared to existing backdoor attacks.
- Understanding intermediate layers using linear classifier probes. In ICLR, 2017.
- {{\{{T-Miner}}\}}: A generative approach to defend against trojan attacks on {{\{{DNN-based}}\}} text classification. In USENIX Security, 2021.
- How to backdoor federated learning. In AISTAT, 2020.
- Qwen technical report, 2023.
- Sequential modeling enables scalable learning for large vision models. arXiv preprint arXiv:2312.00785, 2023.
- Cleanclip: Mitigating data poisoning attacks in multimodal contrastive learning. arXiv preprint arXiv:2303.03323, 2023.
- Language models are few-shot learners. In NeurIPS, 2020.
- Badprompt: Backdoor attacks on continuous prompts. In NeurIPS, 2022.
- Poisoning and backdooring contrastive learning. arXiv preprint arXiv:2106.09667, 2021.
- Extracting training data from large language models. In USENIX Security 21, 2021.
- Mitigating backdoor attacks in lstm-based text classification systems by backdoor keyword identification. Neurocomputing, 2021.
- Clean-image backdoor: Attacking multi-label models with poisoned labels only. In ICLR, 2023.
- Badpre: Task-agnostic backdoor attacks to pre-trained nlp foundation models. In ICLR, 2022.
- Kallima: A clean-label framework for textual backdoor attacks. In ESORICS, 2022.
- Badnl: Backdoor attacks against nlp models with semantic-preserving improvements. In ACSAC, 2021.
- Apple of sodom: Hidden backdoors in superior sentence embeddings via contrastive learning. arXiv preprint arXiv:2210.11082, 2022.
- A backdoor attack against lstm-based text classification systems. IEEE Access, 2019.
- Qlora: Efficient finetuning of quantized llms. In NeurIPS, 2023.
- Bert: Pre-training of deep bidirectional transformers for language understanding. In NeurIPS, 2019.
- Euclidean distance matrices: Essential theory, algorithms, and applications. IEEE Signal Processing Magazine, 2015.
- Robust anomaly detection and backdoor attack detection via differential privacy. arXiv preprint arXiv:1911.07116, 2019.
- Uor: Universal backdoor attacks on pre-trained language models. arXiv preprint arXiv:2305.09574, 2023.
- A framework for few-shot language model evaluation, 12 2023.
- Accelerate: Training and inference at scale made simple, efficient and adaptable. https://github.com/huggingface/accelerate, 2022.
- netfound: Foundation model for network security. arXiv preprint arXiv:2310.17025, 2023.
- Few-shot backdoor attacks via neural tangent kernels. In ICLR, 2023.
- Long short-term memory. Neural computation, 1997.
- Handcrafted backdoors in deep neural networks. In NeurIPS, 2022.
- Training-free lexical backdoor attacks on language models. In WWW, 2023.
- Mixtral of experts. arXiv preprint arXiv:2401.04088, 2024.
- Backdoor attacks for in-context learning with language models. arXiv preprint arXiv:2307.14692, 2023.
- Scaling laws for neural language models. arXiv preprint arXiv:2001.08361, 2020.
- Adam: A method for stochastic optimization. In ICLR, 2014.
- Trojdrl: evaluation of backdoor attacks on deep reinforcement learning. In DAC, 2020.
- Imagenet classification with deep convolutional neural networks. In NeurIPS, 2012.
- Weight poisoning attacks on pre-trained models. In ACL, 2020.
- Efficient memory management for large language model serving with pagedattention. In SOSP, 2023.
- Chatgpt as an attack tool: Stealthy textual backdoor attack via blackbox generative model trigger. arXiv preprint arXiv:2304.14475, 2023.
- Backdoor attacks on pre-trained models by layerwise weight poisoning. arXiv preprint arXiv:2108.13888, 2021.
- Pytorch distributed: Experiences on accelerating data parallel training. VLDB, 2020.
- Badedit: Backdooring large language models by model editing. In ICLR, 2024.
- Truthfulqa: Measuring how models mimic human falsehoods. In ACL, 2021.
- Fine-pruning: Defending against backdooring attacks on deep neural networks. In RAID, 2018.
- Piccolo: Exposing complex backdoors in nlp transformer models. In S&P, 2022.
- Unified-io: A unified model for vision, language, and multi-modal tasks. In ICLR, 2023.
- Dbia: Data-free backdoor injection attack against transformer networks. arXiv preprint arXiv:2111.11870, 2021.
- A data-free backdoor injection approach in neural networks. In USENIX Security, 2023.
- Learning word vectors for sentiment analysis. In ACL, 2011.
- Pointer sentinel mixture models. arXiv preprint arXiv:1609.07843, 2016.
- Meta. Introducing meta llama 3: The most capable openly available llm to date. https://ai.meta.com/blog/meta-llama-3/, 2024.
- Gpt-4 technical report. arXiv preprint arXiv:2303.08774, 2024.
- Bleu: A method for automatic evaluation of machine translation. In ACL, 2002.
- Textguard: Provable defense against backdoor attacks on text classification. In NDSS, 2023.
- Xda: Accurate, robust disassembly with transfer learning. arXiv preprint arXiv:2010.00770, 2020.
- Onion: A simple and effective defense against textual backdoor attacks. In EMNLP, 2021.
- Mind the style of text! adversarial and backdoor attacks based on text style transfer. In EMNLP, 2021.
- Hidden killer: Invisible textual backdoor attacks with syntactic trigger. In ACL-IJCNLP, 2021.
- Turn the combination lock: Learnable textual backdoor attacks via word substitution. In ACL, 2021.
- Fine-tuning aligned language models compromises safety, even when users do not intend to! In ICLR, 2024.
- Zero: Memory optimizations toward training trillion parameter models. In HPC, 2020.
- Know what you don’t know: Unanswerable questions for squad. arXiv preprint arXiv:1806.03822, 2018.
- Squad: 100,000+ questions for machine comprehension of text. In ACL, 2016.
- Deepspeed: System optimizations enable training deep learning models with over 100 billion parameters. In KDD, 2020.
- Backdoor attacks on self-supervised learning. In CVPR, 2022.
- Universal backdoor attacks. In ICLR, 2024.
- Constrained optimization with dynamic bound-scaling for effective nlp backdoor defense. In ICML, 2022.
- Backdoor pre-trained models can transfer to all. In CCS, 2021.
- Badgpt: Exploring security vulnerabilities of chatgpt via backdoor attacks to instructgpt. arXiv preprint arXiv:2304.12298, 2023.
- On the exploitability of instruction tuning. In NeurIPS, 2023.
- Recursive deep models for semantic compositionality over a sentiment treebank. In EMNLP, 2013.
- Sleeper agent: Scalable hidden trigger backdoors for neural networks trained from scratch. Advances in Neural Information Processing Systems, 35:19165–19178, 2022.
- Certified defenses for data poisoning attacks. In NeurIPS, 2017.
- Introduction to Data Mining (2nd Edition). Pearson, 2nd edition, 2018.
- Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288, 2023.
- Spectral signatures in backdoor attacks. In NeurIPS, 2018.
- Attention is all you need. In NeurIPS, 2017.
- Poisoning language models during instruction tuning. In ICML, 2023.
- Attack of the tails: Yes, you really can backdoor federated learning. In NeurIPS, 2020.
- On the exploitability of reinforcement learning with human feedback for large language models. arXiv preprint arXiv:2311.09641, 2023.
- Backdoorl: Backdoor attack against competitive reinforcement learning. arXiv preprint arXiv:2105.00579, 2021.
- Punctuation-level attack: Single-shot and single punctuation can fool text models. In NeurIPS, 2023.
- Rab: Provable robustness against backdoor attacks. In S&P, 2023.
- Shared adversarial unlearning: Backdoor mitigation by unlearning shared adversarial examples. arXiv preprint arXiv:2307.10562, 2023.
- Adversarial neuron pruning purifies backdoored deep models. In NeurIPS, 2021.
- Defending pre-trained language models as few-shot learners against backdoor attacks. In NeurIPS, 2023.
- A unified detection framework for inference-stage backdoor defenses. In NeurIPS, 2023.
- Badchain: Backdoor chain-of-thought prompting for large language models. In ICLR, 2024.
- Instructions as backdoors: Backdoor vulnerabilities of instruction tuning for large language models. arXiv preprint arXiv:2305.14710, 2023.
- Trojprompt: A black-box trojan attack on pre-trained language models. In NeurIPS, 2023.
- Parafuzz: An interpretability-driven technique for detecting poisoned samples in nlp. In NeurIPS, 2023.
- Be careful about poisoned word embeddings: Exploring the vulnerability of the embedding layers in nlp models. arXiv preprint arXiv:2103.15543, 2021.
- Rap: Robustness-aware perturbations for defending against backdoor attacks on nlp models. arXiv preprint arXiv:2110.07831, 2021.
- Rethinking stealthiness of backdoor attack against nlp models. In ACL, 2021.
- Data poisoning attacks against multimodal encoders. In ICML, 2023.
- Adversarial unlearning of backdoors via implicit hypergradient. arXiv preprint arXiv:2110.03735, 2021.
- Opt: Open pre-trained transformer language models. arXiv preprint arXiv:2205.01068, 2022.
- Character-level convolutional networks for text classification. In NeurIPS, 2015.
- Red alarm for pre-trained models: Universal vulnerability to neuron-level backdoor attacks. Machine Intelligence Research, 2023.
- Prompt as triggers for backdoor attack: Examining the vulnerability in language models. arXiv preprint arXiv:2305.01219, 2023.
- Judging llm-as-a-judge with mt-bench and chatbot arena. In NeurIPS, 2023.
- Moderate-fitting as a natural backdoor defender for pre-trained language models. In NeurIPS, 2022.
- Neural polarizer: A lightweight and effective backdoor defense via purifying poisoned features. In NeurIPS, 2023.
- Yuzhou. Nie (13 papers)
- Michael J. De Lucia (5 papers)
- Nathaniel D. Bastian (34 papers)
- Yanting. Wang (1 paper)
- Jinyuan. Jia (1 paper)
- Wenbo. Guo (1 paper)
- Dawn. Song (1 paper)