Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
119 tokens/sec
GPT-4o
56 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

An Empirical Study on Software Defect Prediction with a Simplified Metric Set (1402.3873v4)

Published 17 Feb 2014 in cs.SE

Abstract: Software defect prediction plays a crucial role in estimating the most defect-prone components of software, and a large number of studies have pursued improving prediction accuracy within a project or across projects. However, the rules for making an appropriate decision between within- and cross-project defect prediction when available historical data are insufficient remain unclear. The objective of this work is to validate the feasibility of the predictor built with a simplified metric set for software defect prediction in different scenarios, and to investigate practical guidelines for the choice of training data, classifier and metric subset of a given project. First, based on six typical classifiers, we constructed three types of predictors using the size of software metric set in three scenarios. Then, we validated the acceptable performance of the predictor based on Top-k metrics in terms of statistical methods. Finally, we attempted to minimize the Top-k metric subset by removing redundant metrics, and we tested the stability of such a minimum metric subset with one-way ANOVA tests. The experimental results indicate that (1) the choice of training data should depend on the specific requirement of prediction accuracy; (2) the predictor built with a simplified metric set works well and is very useful in case limited resources are supplied; (3) simple classifiers (e.g., Naive Bayes) also tend to perform well when using a simplified metric set for defect prediction; and (4) in several cases, the minimum metric subset can be identified to facilitate the procedure of general defect prediction with acceptable loss of prediction precision in practice. The guideline for choosing a suitable simplified metric set in different scenarios is presented in Table 12.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (5)
  1. Peng He (63 papers)
  2. Bing Li (374 papers)
  3. Xiao Liu (402 papers)
  4. Jun Chen (374 papers)
  5. Yutao Ma (37 papers)
Citations (274)

Summary

An Empirical Study on Software Defect Prediction with a Simplified Metric Set

The paper "An Empirical Study on Software Defect Prediction with a Simplified Metric Set" focuses on the feasibility and effectiveness of using simplified metric sets in software defect prediction, with an emphasis on both within-project defect prediction (WPDP) and cross-project defect prediction (CPDP). The research hinges on several pivotal questions regarding the selection of training data, metric simplification, and classifier choice—each critically examined to optimize predictive accuracy while minimizing computational overhead.

Research Motivation and Methodology

In software engineering, defect prediction models are vital to identify the most defect-prone components, thereby allowing strategic allocation of resources for testing and maintenance. This study aims to validate whether a simplified approach using a reduced metric set can maintain prediction accuracy when historical data is insufficient or resource constraints are present. The paper proposes a simplified metric identification approach that focuses on the Top-k metrics frequently appearing in prediction models, tested across 34 releases of 10 open-source projects sourced from the PROMISE repository.

The researchers constructed predictive models employing six typical classifiers—J48, Logistic Regression, Na\"{\i}ve Bayes, Decision Table, Support Vector Machine (SVM), and Bayesian Network—and compared the performance across three scenarios of training data:

  1. WPDP using the nearest historical release.
  2. WPDP using all available historical releases.
  3. CPDP using the most suitable releases from other projects.

Key Findings

  1. Training Data Selection:
    • CPDP can yield comparable, and often superior, Recall and F-measure relative to WPDP, particularly when project-specific training data is limited. This suggests an advantage in leveraging external data sources in such contexts.
  2. Simplified Metric Sets:
    • The study finds that defect prediction models constructed with a simplified set of Top-5 frequently occurring metrics demonstrate acceptable performance compared to those using the full set of metrics. This reduction significantly lowers data acquisition and processing costs.
  3. Classifier Performance:
    • Simple classifiers, notably Na\"{\i}ve Bayes, were shown to perform exceptionally well using the simplified metric sets, reinforcing the potential efficacy of less complex models in maintaining the balance between precision and recall.
  4. Minimum Metric Subset:
    • A further reduced metric set, such as CBO+LOC+LCOM, presented stable and comparable results across different scenarios and classifier choices, indicating its potential as a universal baseline for defect prediction with general applicability.

Implications and Future Work

The implications of this research are significant, offering a viable pathway to reduce the complexity of defect prediction models without significantly compromising performance. This has practical utility for software projects constrained by limited resources or historical data. Additionally, the findings support the development of cost-effective and generalized defect prediction frameworks. Future work could expand this research by validating the findings across a broader range of languages, software domains, and incorporating more complex dynamic metrics or process measures. Furthermore, exploring the integration of these simplified models into real-time deployment pipelines could present valuable extensions to practical applications of AI in software engineering tasks.