Bug Taxonomy: Classifying Software Defects
- Bug taxonomy is a structured classification scheme that organizes software defects by attributes such as origin, manifestation, severity, and domain specificity.
- It supports empirical analysis and bug triage automation by leveraging data from version control, issue trackers, and developer feedback to streamline root-cause identification.
- Domain-specific refinements, like those in quantum software and Jupyter Notebooks, guide targeted testing strategies and improve rapid defect resolution in diverse ecosystems.
A bug taxonomy is a structured classification scheme for categorizing software defects based on their origin, manifestation, type, severity, impact, or domain specificity. Rigorous bug taxonomies are central to empirical software engineering, bug triage automation, root-cause analysis, and the creation of diagnostic and quality assurance tools. Modern research develops multidimensional taxonomies grounded in empirical data from version control, issue trackers, patch notes, and domain-specific developer feedback, recognizing that effective taxonomies must adapt to the context—generic software, domain applications (video games, quantum software, computational notebooks), or ecosystem-specific workflows (e.g., NPM packages).
1. Multidimensional Taxonomy Structures
Contemporary bug taxonomies are increasingly multidimensional, combining orthogonal categorization axes to capture defect diversity with granularity and precision.
- Origin-Based Taxonomies: The InEx-Bug taxonomy for the NPM ecosystem formalizes a distinction between Intrinsic (defect in local codebase), Extrinsic (breakage induced by changed dependencies or environmental drift), Not-a-Bug, and Unknown (Wright et al., 13 Feb 2026).
- Domain-Specific Taxonomies: In quantum computing, a five-axis taxonomy distinguishes bug type (quantum/classical/uncategorized), coarse category (16+ classes), severity (Critical/High/Medium/Low), impacted quality attributes (e.g., Usability, Reliability per ISO/IEC 25010), and quantum-specific subtypes (circuit issues, gate errors, etc.) (Yousuf et al., 12 Jun 2025).
- Root Cause/Manifestation Frameworks: Catolino et al. provide a root-cause taxonomy for general-purpose OSS with nine categories: configuration, network, database, GUI, performance, permission/deprecation, security, program anomaly, and test code (Catolino et al., 2019).
- Application/Phenomenology-Oriented: Jupyter Notebooks exhibit bugs in eight high-level categories—kernel, conversion, portability, environment/settings, connection, processing, cell defect, and implementation, each with subtypes tied to interactive notebook workflows (Santana et al., 2022).
- Behavior/Subsystem Taxonomies (Game Software): Game update notes yield a 20-category taxonomy, spanning action, AI, audio, camera, collisions, crash, exploit, UI, value, and more, enabling recurrence and severity analysis in shipped game patches (Truelove et al., 2021).
This multidimensional approach facilitates automated classification and supports broad-spectrum defect analytics across heterogeneous codebases and ecosystems.
2. Canonical Bug Categories Across Domains
While terminology and granularity vary, core category archetypes recur in empirical taxonomies. The following table summarizes frequent top-level classes, with mapping across select studies:
| General (Catolino et al.) | NPM (InEx-Bug) | Quantum Software | Jupyter Notebooks | Video Games |
|---|---|---|---|---|
| Configuration | Intrinsic/Extrinsic | Compatibility, Syntax | Environments & Settings (ES) | Implementation Resp. |
| Performance | Intrinsic | Performance | Processing (PC) | Performance |
| Security | Intrinsic | Security | Environment/Extension (ES) | Security |
| GUI | Intrinsic | Usability | Kernel, Cell Defect (CD) | UI |
| Program/Functional | Intrinsic | Functional, Logical | Implementation (IP) | Action, AI, Bounds |
| Test Code | Not-a-Bug | Test Coverage | Not explicitly classed | Not in taxonomy |
| Network | Extrinsic | Communication | Connection (CN) | Not in taxonomy |
| Database | Intrinsic | Database | Connection/Processing | Not in taxonomy |
Domain-specific categories appear as needed: Quantum Circuit Issues (Yousuf et al., 12 Jun 2025), Kernel Bugs (Santana et al., 2022), Bounds, Triggered Event (Truelove et al., 2021). Precise instantiations are governed by context and targeted user workflows.
3. Methodologies for Taxonomy Construction and Validation
Empirical taxonomies are developed through manual annotation, open/axial coding, triangulation with developer feedback, and automated classifier benchmarking:
- Data Sourcing: Mining of GitHub issues (NPM: 377 issues, 103 repos (Wright et al., 13 Feb 2026); Jupyter: 14,740 commits, 105 repos (Santana et al., 2022)), Stack Overflow extraction (30,416 posts, Jupyter), game patch notes (12,122 fixes, 723 patches (Truelove et al., 2021)), or domain repositories (12,910 Qiskit issues (Yousuf et al., 12 Jun 2025)).
- Labeling Protocols: Multi-rater, rubric-driven annotation (100% agreement in game taxonomy (Truelove et al., 2021)); formal point-wise agreement statistics (Krippendorff’s up to 0.96 (Catolino et al., 2019); Cohen’s kappa: 0.826 for quantum categories, 0.818 for quality attribute (Yousuf et al., 12 Jun 2025)).
- Automated Classification: TF-IDF and label/keyword-driven models outperform embeddings (quantum: F1 up to 0.83, Catolino: F1=0.64 (Catolino et al., 2019), quantum: accuracy 0.85 (Yousuf et al., 12 Jun 2025)); SMOTE balancing and hyperparameter optimization are standard.
- Survey and Interview Validation: Direct developer input (Jupyter: 19 interviews (Santana et al., 2022); games: 47 respondents (Truelove et al., 2021)) substantiates category relevance, recurrence, and severity.
Taxonomies are iteratively refined as evolving codebases introduce new failure modes and as empirical inter-rater agreement guides merging or splitting of classes.
4. Quantitative Patterns and Comparative Empirical Findings
Quantitative analyses of taxonomy categories expose defect prevalence, severity, recurrence, and ecosystem fragilities:
- Distribution and Growth: Implementation bugs dominate Jupyter commit logs (44.2% in GH; 22% in SO), while Environments/Settings dominate SO posts (43.2%) (Santana et al., 2022). In quantum software, classical bugs comprise 67.2%, with quantum-specific at 27.3% (Yousuf et al., 12 Jun 2025).
- Resolution Dynamics: In NPM, Intrinsic bugs resolve faster (median 8.9 days) and more frequently require code modification (56.7% vs. 28.1%) than Extrinsic (median 10.2 days; Mann–Whitney , ) (Wright et al., 13 Feb 2026).
- Severity and Recurrence: Game taxonomy identifies Crash bugs as both among the most severe (severity 38%) and the most recurrent, while Camera/Event Occurrence/Exploit are least severe (<10%) (Truelove et al., 2021). Jupyter kernel bugs, while infrequent in commits (2.9%), have high end-user impact (universal user frustration) (Santana et al., 2022).
- Root Causes and Impacts: Installation/configuration and version mismatches are predominant in Jupyter (Install & Config 32.1% SO/16.3% GH, Version 19.0% SO/22.5% GH) (Santana et al., 2022). Functional anomalies are the most frequent in open-source root-cause taxonomy (41%) (Catolino et al., 2019).
Severity metrics in quantum software show 93.7% of issues as Low, only 4.3% as Critical (Yousuf et al., 12 Jun 2025). Recurrence analysis in games relies on cosine-similarity clustering and manual vetting for true repeat bug types (Truelove et al., 2021).
5. Classification Criteria, Key Metrics, and Automation
Formalized criteria, metrics, and automation pipelines underpin objective bug triage and analysis:
- Classification Criteria: Issue text, code change evidence (e.g., PR within 7 days for InEx-Bug (Wright et al., 13 Feb 2026)), dependency/environment context, root-cause signals from stack traces and reproduction narratives.
- Temporal and Behavioral Metrics:
- Median Resolution Time: .
- Reopen Rate: .
- Recurrence Delay: (Wright et al., 13 Feb 2026).
- Automated Tools: Rule-based NLP frameworks (quantum), logistic regression, SVM, or random forest classifiers informed by TF-IDF features attain high agreement and support for bug-type, category, and quality-attribute inference; severity assignment remains more challenging (quantum Cohen’s kappa 0.162 for severity vs. 0.712–0.826 for other axes) (Yousuf et al., 12 Jun 2025).
Process automation is recently extended to multidimensional label assignment and used for real-time triage and health monitoring in large project and ecosystem contexts.
6. Domain-Specific Implications and Best Practices
Taxonomies inform both empirical research and practical tooling:
- Bug Triage and Assignment: Structured, automated labels (root-cause, origin) expedite assignment to domain specialists (e.g., route Extrinsic bugs upstream, fast-track security issues) (Wright et al., 13 Feb 2026, Catolino et al., 2019).
- Testing Strategies: Analysis of recurrence and severity supports prioritization (emphasize Crash/Object Persistence in games, kernel/processing bugs in Jupyter, quantum circuit issues in Qiskit) (Truelove et al., 2021, Yousuf et al., 12 Jun 2025, Santana et al., 2022).
- Ecosystem Health and Maintenance: Rising Extrinsic bug rate signals dependency/API fragility in NPM; nuanced metrics guide integration-test deployment post-upgrades (Wright et al., 13 Feb 2026). In Jupyter, the prevalence of ES and Implementation bugs motivates built-in environment checkers, version management, and advanced linting (Santana et al., 2022).
- Code Quality and Process Improvements: Recurrent and severe bugs in games arise from insufficient multi-component/interaction testing, complex data dependencies, and weak telemetry (Truelove et al., 2021). Improved architectural modularity and standardized issue templates improve root-cause observability and reduce Not-a-Bug noise (Wright et al., 13 Feb 2026).
- Research Tooling Needs: Integration of root-cause prediction into triage platforms, domain-specific static analysis (configuration linters, quantum circuit verifiers), and visual diffs for literate notebook environments are prioritized advancements (Santana et al., 2022, Yousuf et al., 12 Jun 2025, Catolino et al., 2019).
7. Comparative Limitations and Future Extensions
Limitations in current taxonomies motivate ongoing research:
- Coverage and Generalizability: Some taxonomies derive from open-source, public projects and may not extend directly to closed-source or specialized industrial contexts (Catolino et al., 2019).
- Feature Scope: Automated classifiers limited to text summaries underperform on subtle categories; inclusion of comments, tracebacks, patches, and code metrics is advocated (Catolino et al., 2019, Yousuf et al., 12 Jun 2025).
- Taxonomic Refinement: High-volume categories often mask distinct error modes (e.g., “Program Anomaly” or “Implementation”), suggesting need for finer-grained decomposition over time (Catolino et al., 2019, Santana et al., 2022).
- Domain Evolution: Emerging technologies (quantum software, notebooks-in-production) and shifting development paradigms (microservices, multiparadigm environments) necessitate regular taxonomy revision.
Continued triangulation among empirical mining, user studies, and automated inferential tools is fundamental to sustaining the relevance and efficacy of bug taxonomies for software engineering research and practice.