DeepXplore: Automated Whitebox Testing of Deep Learning Systems (1705.06640v4)

Published 18 May 2017 in cs.LG, cs.CR, and cs.SE

Abstract: Deep learning (DL) systems are increasingly deployed in safety- and security-critical domains including self-driving cars and malware detection, where the correctness and predictability of a system's behavior for corner case inputs are of great importance. Existing DL testing depends heavily on manually labeled data and therefore often fails to expose erroneous behaviors for rare inputs. We design, implement, and evaluate DeepXplore, the first whitebox framework for systematically testing real-world DL systems. First, we introduce neuron coverage for systematically measuring the parts of a DL system exercised by test inputs. Next, we leverage multiple DL systems with similar functionality as cross-referencing oracles to avoid manual checking. Finally, we demonstrate how finding inputs for DL systems that both trigger many differential behaviors and achieve high neuron coverage can be represented as a joint optimization problem and solved efficiently using gradient-based search techniques. DeepXplore efficiently finds thousands of incorrect corner case behaviors (e.g., self-driving cars crashing into guard rails and malware masquerading as benign software) in state-of-the-art DL models with thousands of neurons trained on five popular datasets including ImageNet and Udacity self-driving challenge data. For all tested DL models, on average, DeepXplore generated one test input demonstrating incorrect behavior within one second while running only on a commodity laptop. We further show that the test inputs generated by DeepXplore can also be used to retrain the corresponding DL model to improve the model's accuracy by up to 3%.

Authors (4)

Kexin Pei (20 papers)
Yinzhi Cao (26 papers)
Junfeng Yang (80 papers)
Suman Jana (50 papers)

Citations (1,288)

View on Semantic Scholar

Summary

The paper introduces a novel testing framework that uses neuron coverage and differential testing to detect erroneous corner-case behaviors in deep learning models.
It employs a gradient-based search algorithm to maximize neuron coverage, outperforming traditional adversarial and random testing methods.
DeepXplore streamlines error detection in safety-critical systems by reducing the need for manual labeling and enhancing models through robust retraining.

Insightful Overview of "DeepXplore: Automated Whitebox Testing of Deep Learning Systems"

The academic paper titled "DeepXplore: Automated Whitebox Testing of Deep Learning Systems" presents a significant advancement in the systematic testing of Deep Learning (DL) systems. Authored by Kexin Pei, Yinzhi Cao, Junfeng Yang, and Suman Jana, this work introduces DeepXplore, an automated framework designed to uncover incorrect behaviors in DL models without requiring manual labeling.

Summary of Contributions and Results

DeepXplore addresses critical gaps in the existing DL testing process, focusing on safety- and security-critical systems like self-driving vehicles and malware detection systems. The conventional DL testing heavily relies on manually labeled data and lacks the capacity to expose erroneous behaviors for rare or corner-case inputs. This paper offers a structured and systematic alternative through whitebox testing, which leverages internal information of DL models to enhance the testing efficacy.

The key contributions of DeepXplore revolve around introducing 'neuron coverage' as a novel metric for measuring the thoroughness of DL testing and employing differential testing across multiple DL models with similar functionalities to identify potential errors automatically. Neuron coverage effectively gauges which parts of a DL model’s logic are exercised by test inputs, akin to code coverage in traditional software testing.

DeepXplore implements a gradient-based search algorithm that substantiates the joint optimization problem—maximizing neuron coverage and differential behavior detection. The framework demonstrated its capability by efficiently identifying thousands of incorrect corner-case behaviors across several state-of-the-art DL models. For instance, during evaluations on models trained with ImageNet and Udacity self-driving challenge data, DeepXplore discovered erroneous behaviors (e.g., self-driving cars making incorrect navigational decisions) and achieved higher neuron coverage than existing adversarial and random testing methods.

Theoretical and Practical Implications

Theoretically, the paper extends the frontiers of testing methodologies for DL systems by illustrating that neuron coverage is a more accurate metric compared to code coverage for evaluating DL testing comprehensiveness. The experimental results validate that even minimal increases in neuron coverage can significantly diversify the types of erroneous behaviors detected. This insight is critical as it sheds light on the importance of exercising more neural pathways within the models to uncover latent bugs that could potentially be critical in practical deployments.

Practically, the implications of DeepXplore are multifaceted:

Automated Error Detection: By eliminating the need for manual labeling, DeepXplore can substantially reduce the resources and costs associated with DL model validation in industrial settings.
Data Augmentation for Robust Training: The erroneous inputs identified can be incorporated into retraining pipelines to fortify models against similar future failures, thereby enhancing the overall robustness of DL systems.
Polluted Data Detection: Beyond identifying erroneous behaviors, DeepXplore shows promise in detecting polluted datasets, which can be a potent mechanism against data poisoning attacks in adversarial settings.

Future Prospects for Research

The presented work opens several avenues for future research. Firstly, exploring deeper into neuron coverage can lead to more refined metrics that can predict DL model robustness with greater accuracy. Additionally, integrating DeepXplore with formal verification methods could offer a more comprehensive framework that not only detects but also guarantees the absence of specific erroneous behaviors within prescribed bounds.

Furthermore, addressing the limitations of differential testing by extending the framework to handle scenarios where highly similar models might not yield sufficient divergent behaviors will be beneficial. Investigating the thresholds at which DeepXplore's efficacy diminishes can help refine the model architectures and training methodologies for better alignment with automated testing frameworks.

In conclusion, "DeepXplore: Automated Whitebox Testing of Deep Learning Systems" represents a consequential step towards more reliable and systematic testing of DL models. The approaches delineated in the paper emphasize methodological rigor and offer practical tools and insights that can significantly enhance the safety and robustness of DL systems in critical applications. This work not only elevates the standards for DL testing but also sets a foundation for future innovations in the field of automated DL validation.