Taxonomy of Real Faults in Deep Learning Systems (1910.11015v3)

Published 24 Oct 2019 in cs.SE, cs.AI, and cs.LG

Abstract: The growing application of deep neural networks in safety-critical domains makes the analysis of faults that occur in such systems of enormous importance. In this paper we introduce a large taxonomy of faults in deep learning (DL) systems. We have manually analysed 1059 artefacts gathered from GitHub commits and issues of projects that use the most popular DL frameworks (TensorFlow, Keras and PyTorch) and from related Stack Overflow posts. Structured interviews with 20 researchers and practitioners describing the problems they have encountered in their experience have enriched our taxonomy with a variety of additional faults that did not emerge from the other two sources. Our final taxonomy was validated with a survey involving an additional set of 21 developers, confirming that almost all fault categories (13/15) were experienced by at least 50% of the survey participants.

Citations (247)

View on Semantic Scholar

Summary

The paper presents a taxonomy categorizing DL faults identified from 1,059 artifacts and structured developer interviews.
It employs a bottom-up methodology to group fault causes into five key categories: models, tensor inputs, training, GPU usage, and APIs.
The validated taxonomy offers actionable guidelines for developers to enhance testing strategies and improve DL system robustness.

Taxonomy of Real Faults in Deep Learning Systems

The paper "Taxonomy of Real Faults in Deep Learning Systems" presents an empirical paper to identify and classify faults that occur in deep learning (DL) systems. This research is centered on the need to understand and mitigate the impact of faults as DL systems are increasingly deployed in critical environments. The authors contribute to the field by developing a taxonomy of faults identified through a rigorous analysis of 1,059 artifacts from GitHub and Stack Overflow, complemented with structured interviews from 20 researchers and practitioners. The taxonomy is further validated through a survey of an additional 21 developers.

Methodology

The authors employed a comprehensive approach to construct the taxonomy, which involved:

Artifact Analysis: Manually examining artifacts from GitHub commits/issues and Stack Overflow discussions related to TensorFlow, Keras, and PyTorch. This entailed identifying the root causes of faults described in these artifacts.
Interviews: Conducting structured interviews with 20 developers across research and industry to gather qualitative insights about the faults they encounter in DL system development. This helped capture faults that might not be evident in public code or discussion platforms.
Validation Survey: A survey involving 21 additional DL developers validated the taxonomy, confirming that a majority experienced most of the identified fault categories.

The researchers constructed the taxonomy using a bottom-up approach, grouping similar root causes into cohesive categories and subcategories. This method ensured that the taxonomy remains expansive and adaptable to newly emerging fault types in DL systems.

Taxonomy Overview

The taxonomy is segmented into five top-level categories reflecting the main areas where issues typically arise in DL systems:

Model: Faults related to the model structure, including incorrect initialization, inappropriate model types, or suboptimal layer configurations.
Tensors and Inputs: Issues concerning the shape, type, and format of input data, such as incorrect tensor shapes affecting operations.
Training: A comprehensive category addressing hyperparameter tuning, loss and optimization functions, data preprocessing, and data quality issues.
GPU Usage: Problems specifically related to the use of GPU resources, such as incorrect data transfer or state management.
API Issues: Misunderstandings or incorrect usage of API functions within DL frameworks.

Implications and Future Work

The taxonomy serves as both a guide for developers to preemptively address common DL faults and a checklist for testers to develop more comprehensive test scenarios. These insights can foster more robust DL systems by enhancing the fault detection and mitigation strategies employed during development and testing phases.

Moreover, the taxonomy lays the groundwork for further research on automatically identifying and categorizing DL faults, potentially leading to improved tooling and methodologies in DL system development. This work could also inspire the creation of novel mutation testing methods specific to DL, ensuring that testing scenarios remain relevant to real-world fault conditions.

The paper, noting the limitations and distinctions from previous research, highlights the uniqueness and breadth of this taxonomy. This contribution resonates with the ongoing discourse on the reliability and robustness of DL systems and adds a valuable layer of understanding to AI system testing and validation. Future advancements may focus on extending these findings to other emerging frameworks and the development of automated tools to support these insights.

PDF Markdown