Papers

Topics

Authors

Recent

View all

Detailed Answer

Quick Answer

Concise responses based on abstracts only

Detailed Answer

Well-researched responses based on abstracts and relevant paper content.

Custom Instructions Pro

Preferences or requirements that you'd like Emergent Mind to consider when generating responses

Gemini 2.5 Flash

Gemini 2.5 Flash 87 tok/s

Gemini 2.5 Pro 45 tok/s Pro

GPT-5 Medium 32 tok/s Pro

GPT-5 High 29 tok/s Pro

GPT-4o 105 tok/s Pro

Kimi K2 202 tok/s Pro

GPT OSS 120B 461 tok/s Pro

Claude Sonnet 4 39 tok/s Pro

2000 character limit reached

Sleeper Agent: Scalable Hidden Trigger Backdoors for Neural Networks Trained from Scratch (2106.08970v3)

Published 16 Jun 2021 in cs.LG, cs.CR, and cs.CV

Abstract: As the curation of data for machine learning becomes increasingly automated, dataset tampering is a mounting threat. Backdoor attackers tamper with training data to embed a vulnerability in models that are trained on that data. This vulnerability is then activated at inference time by placing a "trigger" into the model's input. Typical backdoor attacks insert the trigger directly into the training data, although the presence of such an attack may be visible upon inspection. In contrast, the Hidden Trigger Backdoor Attack achieves poisoning without placing a trigger into the training data at all. However, this hidden trigger attack is ineffective at poisoning neural networks trained from scratch. We develop a new hidden trigger attack, Sleeper Agent, which employs gradient matching, data selection, and target model re-training during the crafting process. Sleeper Agent is the first hidden trigger backdoor attack to be effective against neural networks trained from scratch. We demonstrate its effectiveness on ImageNet and in black-box settings. Our implementation code can be found at https://github.com/hsouri/Sleeper-Agent.

Citations (112)

View on Semantic Scholar

Collections

Summary

The paper introduces Sleeper Agent, a novel method that embeds hidden trigger backdoors in neural networks trained from scratch.
It leverages gradient matching, strategic data selection, and adaptive retraining to achieve over 85% attack success rate on CIFAR-10 with just a 1% poison budget.
Empirical results on ImageNet and diverse architectures demonstrate its scalability and stealth, raising significant security concerns for automated data curation in deep learning.

Overview of "Sleeper Agent: Scalable Hidden Trigger Backdoors for Neural Networks Trained from Scratch"

The paper "Sleeper Agent: Scalable Hidden Trigger Backdoors for Neural Networks Trained from Scratch" addresses the security vulnerabilities in modern deep learning systems arising from data poisoning attacks. Specifically, it presents a novel approach to hidden trigger backdoor attacks, a significant threat in machine learning environments where data curation is largely automated. These attacks involve manipulating training data to embed undetectable 'triggers' that alter model behavior during inference when specific conditions are met.

Key Contributions

The central innovation of this work is the development of the "Sleeper Agent" attack. This attack uses a combination of gradient matching, strategic data selection, and model re-training to create hidden backdoors in neural networks, demonstrating effectiveness even when the models are trained from scratch—a scenario where prior methods like the Hidden Trigger Backdoor (HTBD) have failed. The paper provides empirical evidence of Sleeper Agent’s success across various settings, including large-scale datasets like ImageNet and models trained in black-box conditions where the attacker's knowledge of the victim's architecture is minimal.

Technical Approach

Gradient Matching: The core of the Sleeper Agent attack is a gradient alignment strategy that substitutes direct solving of the bi-level optimization problem (induced by the training dynamics on poisoned data) with a gradient alignment objective. This approach sidesteps the complexity typically associated with differentiating through the training process of deep networks.
Data Selection: The method employs data selection tactics to enhance poison effectiveness, targeting images that carry significant training impact. By poisoning these influential samples, the attack amplifies its success rate significantly.
Adaptive Retraining: During the crafting process, the Sleeper Agent periodically retrains surrogate models on poisoned data. This retraining aims to mirror the updates a neural network would realistically undergo if it were being trained on such corrupted datasets, thus refining the alignment between attack objectives and model parameters.
Scalability and Transferability: The Sleeper Agent scales efficiently to large datasets and demonstrates considerable transferability across different neural architectures, reinforcing its practicality in real-world scenarios where the victim's model specifics are unknown.

Results

Empirical results showcase the impressive efficacy of Sleeper Agent, with attack success rates exceeding 85% when using a poison budget of 1% on the CIFAR-10 dataset and maintaining high validation accuracies akin to unpoisoned models. ImageNet results further confirm its utility, achieving substantial misclassification rates despite stringent conditions like a low poison budget (0.05%). These outcomes highlight the robustness and stealth of the Sleeper Agent even under adversarial training setups.

Implications and Future Directions

The research has important implications for the security of machine learning systems. As large-scale automated data collection becomes ubiquitous, the potential for data poisoning attacks like the Sleeper Agent poses a genuine risk. This necessitates the development of more sophisticated defenses and encourages the integration of adversarial robustness as a core component of model training pipelines.

Future work could explore refining defense mechanisms against such hidden trigger attacks and extending the analysis to adversarial environments involving more complex multi-modal datasets. Additionally, investigations into the interplay between adversarial training and poisoning effectiveness may yield insights into devising hard-to-compromise machine learning models.

In conclusion, the paper contributes significantly to the understanding of backdoor vulnerabilities, providing a potent tool for attackers while simultaneously challenging the community to counteract such threats in future neural network designs.