Detecting AI Trojans Using Meta Neural Analysis (1910.03137v4)

Published 8 Oct 2019 in cs.AI, cs.CR, and cs.LG

Abstract: In machine learning Trojan attacks, an adversary trains a corrupted model that obtains good performance on normal data but behaves maliciously on data samples with certain trigger patterns. Several approaches have been proposed to detect such attacks, but they make undesirable assumptions about the attack strategies or require direct access to the trained models, which restricts their utility in practice. This paper addresses these challenges by introducing a Meta Neural Trojan Detection (MNTD) pipeline that does not make assumptions on the attack strategies and only needs black-box access to models. The strategy is to train a meta-classifier that predicts whether a given target model is Trojaned. To train the meta-model without knowledge of the attack strategy, we introduce a technique called jumbo learning that samples a set of Trojaned models following a general distribution. We then dynamically optimize a query set together with the meta-classifier to distinguish between Trojaned and benign models. We evaluate MNTD with experiments on vision, speech, tabular data and natural language text datasets, and against different Trojan attacks such as data poisoning attack, model manipulation attack, and latent attack. We show that MNTD achieves 97% detection AUC score and significantly outperforms existing detection approaches. In addition, MNTD generalizes well and achieves high detection performance against unforeseen attacks. We also propose a robust MNTD pipeline which achieves 90% detection AUC even when the attacker aims to evade the detection with full knowledge of the system.

Authors (6)

Xiaojun Xu (30 papers)
Qi Wang (561 papers)
Huichen Li (8 papers)
Nikita Borisov (26 papers)
Carl A. Gunter (16 papers)
Bo Li (1107 papers)

Citations (292)

View on Semantic Scholar

Summary

The paper introduces a meta-classifier that detects AI Trojans using only black-box access, bypassing reliance on model internals.
It employs jumbo learning to generate diverse Trojan shadow models, broadening detection across varied attack scenarios.
Comprehensive evaluations across multiple datasets demonstrate 97% AUC and 90% efficacy even under adaptive evasion strategies.

Essay on "Detecting AI Trojans Using Meta Neural Analysis"

The paper presents a novel methodology for detecting Trojan attacks in AI models, specifically deep neural networks (DNNs), through a process termed Meta Neural Trojan Detection (MNTD). Trojan attacks present serious threats to ML systems, enabling a model to act normally on clean inputs while executing adversarial tasks when presented with specific "trigger" inputs. The authors scrutinize existing detection strategies, highlighting their limitations due to assumptions about attack methods and necessitating access to model internals.

Key Contributions

Meta-classifier Development: MNTD introduces a meta-classifier, a higher-order model trained to identify whether a DNN is Trojaned or benign, using only black-box access. This approach contrasts sharply with existing methods that often require white-box access or assume specific attack traits.
Jumbo Learning Approach: A core innovation in MNTD is jumbo learning, where Trojaned shadow models are generated following a broad distribution of potential attack vectors. The shadow models, varying in attack settings, are employed to train the meta-classifier. Unlike typical practices that focus on pre-established trigger patterns or standard poisoning attacks, jumbo learning samples a wide array of hypothetical Trojan attacks, effectively broadening the detection horizon.
Black-box Model Testing: This paper underscores the significance of black-box testing by optimizing queries that extract model representations. Query tuning is applied to derive insightful input-output representations used to fortify the meta-classifier.
Comprehensive Evaluation: The assessments, spanning computer vision, speech recognition, tabular data, and NLP datasets, reveal MNTD's effectiveness. The results demonstrate a 97% detection AUC compared to existing methods, which underscores the robustness of the approach across unseen and adaptive Trojans. Moreover, MNTD maintains a 90% detection efficacy even under evasion strategies by adversaries aware of the detection procedure.

Implications and Future Work

The implications of this work extend to AI security, particularly in safety-critical applications. Detecting Trojans with minimal assumptions and limited access to models enhances applicability in real-world scenarios, where AI models are often consumed as black-box services.

Future development could explore further minimizing reliance on clean datasets, optimizing query efficiency, and expanding MNTD to accommodate alternative model types beyond neural networks. The generalization abilities of MNTD pave the way for adapting this approach to emerging AI threats, maintaining an arms-length advantage over adversaries crafting increasingly sophisticated attacks.

In conclusion, the authors propose a strong framework advancing the detection of AI Trojans, leveraging meta-learning to draw broad, versatile, and potent detection capabilities without succumbing to the constraints prevalent in existing methodologies. MNTD is a pivotal step forward in enhancing the resilience of AI systems against malicious interference.

PDF Markdown

Detecting AI Trojans Using Meta Neural Analysis (1910.03137v4)

Summary

Essay on "Detecting AI Trojans Using Meta Neural Analysis"

Key Contributions

Implications and Future Work

Related Papers