BadActs: A Universal Backdoor Defense in the Activation Space (2405.11227v1)
Abstract: Backdoor attacks pose an increasingly severe security threat to Deep Neural Networks (DNNs) during their development stage. In response, backdoor sample purification has emerged as a promising defense mechanism, aiming to eliminate backdoor triggers while preserving the integrity of the clean content in the samples. However, existing approaches have been predominantly focused on the word space, which are ineffective against feature-space triggers and significantly impair performance on clean data. To address this, we introduce a universal backdoor defense that purifies backdoor samples in the activation space by drawing abnormal activations towards optimized minimum clean activation distribution intervals. The advantages of our approach are twofold: (1) By operating in the activation space, our method captures from surface-level information like words to higher-level semantic concepts such as syntax, thus counteracting diverse triggers; (2) the fine-grained continuous nature of the activation space allows for more precise preservation of clean content while removing triggers. Furthermore, we propose a detection module based on statistical information of abnormal activations, to achieve a better trade-off between clean accuracy and defending performance.
- Omer Antverg and Yonatan Belinkov. 2022. On the pitfalls of analyzing individual neurons in language models. In International Conference on Learning Representations.
- T-miner: A generative approach to defend against trojan attacks on dnn-based text classification. In USENIX Security Symposium.
- Eugene Bagdasaryan and Vitaly Shmatikov. 2022. Spinning language models: Risks of propaganda-as-a-service and countermeasures. In S&P.
- Language models are few-shot learners. In NIPS.
- Badprompt: Backdoor attacks on continuous prompts. In NeurIPS.
- Chuanshuai Chen and Jiazhu Dai. 2021. Mitigating backdoor attacks in lstm-based text classification systems by backdoor keyword identification. Neurocomputing.
- Holistic sentence embeddings for better out-of-distribution detection. In Findings of the Association for Computational Linguistics: EMNLP 2022, pages 6676–6686.
- Fine-tuning deteriorates general textual out-of-distribution detection by distorting task-agnostic features. In Findings of the Association for Computational Linguistics: EACL 2023, pages 564–579.
- Expose backdoors on the way: A feature-based efficient defense against textual backdoor attacks. In Findings of EMNLP.
- Badnl: Backdoor attacks against nlp models with semantic-preserving improvements. In ACSAC.
- Targeted backdoor attacks on deep learning systems using data poisoning. arXiv:1712.05526.
- ELECTRA: pre-training text encoders as discriminators rather than generators. In ICLR.
- A unified evaluation of textual backdoor learning: Frameworks and benchmarks. In NIPS.
- A unified evaluation of textual backdoor learning: Frameworks and benchmarks. In Proceedings of NeurIPS: Datasets and Benchmarks.
- A backdoor attack against lstm-based text classification systems. arXiv:1905.12457.
- Automated hate speech detection and the problem of offensive language. In ICWSM.
- BERT: pre-training of deep bidirectional transformers for language understanding. In NAACL.
- Design and evaluation of a multi-domain trojan detection method on deep neural networks. TDSC.
- Strip: A defence against trojan attacks on deep neural networks. In ACSAC.
- Badnets: Identifying vulnerabilities in the machine learning model supply chain. arXiv:1708.06733.
- IMBERT: Making BERT immune to insertion-based backdoor attacks. In Proceedings of the 3rd Workshop on Trustworthy Natural Language Processing (TrustNLP 2023).
- Mitigating backdoor poisoning attacks through the lens of spurious correlation. arXiv:2305.11596.
- Feature space singularity for out-of-distribution detection. In Proceedings of the Workshop on Artificial Intelligence Safety 2021 (SafeAI 2021) co-located with the Thirty-Fifth AAAI Conference on Artificial Intelligence (AAAI 2021), Virtual, February 8, 2021, volume 2808 of CEUR Workshop Proceedings. CEUR-WS.org.
- Backdoor defense via decoupling the training process. In ICLR.
- Weight poisoning attacks on pretrained models. In ACL.
- Defending against insertion-based textual backdoor attacks via attribution. In Findings of ACL.
- Hidden backdoors in human-centric language models. In CCS.
- Backdoor learning: A survey. IEEE Transactions on Neural Networks and Learning Systems.
- Invisible backdoor attack with sample-specific triggers. In ICCV.
- Composite backdoor attack for deep neural network by mixing existing benign features. In CCS.
- Fine-pruning: Defending against backdooring attacks on deep neural networks. In RAID.
- From shortcuts to triggers: Backdoor defense with denoised poe. arXiv:2305.14910.
- The devil is in the neurons: Interpreting and mitigating social biases in language models. In The Twelfth International Conference on Learning Representations.
- Piccolo: Exposing complex backdoors in NLP transformer models. In S&P.
- Roberta: A robustly optimized BERT pretraining approach. arXiv:1907.11692.
- Ilya Loshchilov and Frank Hutter. 2019. Decoupled weight decay regularization. In ICLR.
- Trojtext: Test-time invisible textual trojan insertion. In The Eleventh International Conference on Learning Representations, ICLR 2023, Kigali, Rwanda, May 1-5, 2023.
- A study of the attention abnormality in trojaned berts. In NAACL.
- Characterizing adversarial subspaces using local intrinsic dimensionality. In 6th International Conference on Learning Representations, ICLR 2018, Vancouver, BC, Canada, April 30 - May 3, 2018, Conference Track Proceedings. OpenReview.net.
- NOTABLE: transferable backdoor attacks against prompt-based NLP models. In ACL.
- Locating and editing factual associations in GPT. In NIPS.
- Hidden trigger backdoor attack on NLP models via linguistic style manipulation. In USENIX Security Symposium.
- Pytorch: An imperative style, high-performance deep learning library. In H. Wallach, H. Larochelle, A. Beygelzimer, F. d'Alché-Buc, E. Fox, and R. Garnett, editors, Advances in Neural Information Processing Systems 32, pages 8024–8035. Curran Associates, Inc.
- Revisiting mahalanobis distance for transformer-based out-of-domain detection. In AAAI.
- Friedrich Pukelsheim. 1994. The three sigma rule. The American Statistician.
- ONION: A simple and effective defense against textual backdoor attacks. In EMNLP.
- Mind the style of text! adversarial and backdoor attacks based on text style transfer. In EMNLP.
- Hidden killer: Invisible textual backdoor attacks with syntactic trigger. In ACL.
- Shebuti Rayana and Leman Akoglu. 2015. Collective opinion spam detection: Bridging review networks and metadata. In KDD.
- Neuron-level interpretation of deep NLP models: A survey. Trans. Assoc. Comput. Linguistics.
- Constrained optimization with dynamic bound-scaling for effective NLP backdoor defense. In ICML.
- Rethink stealthy backdoor attacks in natural language processing. arXiv:2201.02993.
- Backdoor pre-trained models can transfer to all. In CCS.
- Recursive deep models for semantic compositionality over a sentiment treebank. In EMNLP.
- Defending against backdoor attacks in natural language generation. arXiv:2106.01810.
- Setting the trap: Capturing and defeating backdoors in pretrained language models through honeypots. arXiv:2310.18633.
- Poisoning language models during instruction tuning. In ICML.
- Mm-bd: Post-training detection of backdoor attacks with arbitrary backdoor pattern types using a maximum margin statistic. arXiv:2205.06900.
- Rethinking textual adversarial defense for pre-trained language models. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 30:2526–2540.
- Weaver: Foundation models for creative writing. arXiv:2401.17268.
- Transformers: State-of-the-art natural language processing. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pages 38–45.
- Defending pre-trained language models as few-shot learners against backdoor attacks. arXiv:2309.13256.
- BITE: textual backdoor attacks with iterative trigger injection. In ACL.
- Watch out for your agents! investigating backdoor threats to llm-based agents. arXiv preprint arXiv:2402.11208.
- Be careful about poisoned word embeddings: Exploring the vulnerability of the embedding layers in NLP models. In NAACL.
- RAP: robustness-aware perturbations for defending against backdoor attacks on NLP models. In EMNLP.
- Rethinking stealthiness of backdoor attack against NLP models. In ACL.
- Character-level convolutional networks for text classification. In NIPS.
- Fine-mixing: Mitigating backdoors in fine-tuned language models. In Findings of EMNLP.
- Defeat: Deep hidden feature backdoor attacks by imperceptible perturbation and latent representation constraints. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 15213–15222.
- Pre-activation distributions expose backdoor neurons. In NIPS.
- Imperceptible backdoor attack: From input space to feature representation. CoRR, abs/2205.03190.
- Moderate-fitting as a natural backdoor defender for pre-trained language models. In NIPS.