Less is More: Minimizing Code Reorganization using XTREE (1609.03614v4)

Published 12 Sep 2016 in cs.SE

Abstract: Context: Developers use bad code smells to guide code reorganization. Yet developers, text books, tools, and researchers disagree on which bad smells are important. Objective: To evaluate the likelihood that a code reorganization to address bad code smells will yield improvement in the defect-proneness of the code. Method: We introduce XTREE, a tool that analyzes a historical log of defects seen previously in the code and generates a set of useful code changes. Any bad smell that requires changes outside of that set can be deprioritized (since there is no historical evidence that the bad smell causes any problems). Evaluation: We evaluate XTREE's recommendations for bad smell improvement against recommendations from previous work (Shatnawi, Alves, and Borges) using multiple data sets of code metrics and defect counts. Results: Code modules that are changed in response to XTREE's recommendations contain significantly fewer defects than recommendations from previous studies. Further, XTREE endorses changes to very few code metrics, and the bad smell recommendations (learned from previous studies) are not universal to all software projects. Conclusion: Before undertaking a code reorganization based on a bad smell report, use a tool like XTREE to check and ignore any such operations that are useless; i.e. ones which lack evidence in the historical record that it is useful to make that change. Note that this use case applies to both manual code reorganizations proposed by developers as well as those conducted by automatic methods. This recommendation assumes that there is an historical record. If none exists, then the results of this paper could be used as a guide.

Citations (22)

View on Semantic Scholar

Summary

The paper introduces XTREE as a primary change oracle that leverages historical defect data to identify minimal changes in code metrics.
XTREE employs iterative dichotomization and decision trees to discern targeted metric adjustments for reducing defect rates.
Evaluation across five Java datasets demonstrates that XTREE recommends fewer, more stable changes while effectively lowering defect proneness.

Developers often use the concept of "bad smells" – surface indications of deeper code problems like large classes or long methods – to guide code reorganization efforts. While tools and literature exist to identify these smells, there's significant disagreement on which bad smells are truly important and worth fixing. Furthermore, arbitrarily fixing all detected bad smells is costly and doesn't guarantee improvement in code quality, particularly defect proneness. This paper introduces XTREE (Less is More: Minimizing Code Reorganization using XTREE, 2016), a framework designed to address this challenge by using historical defect data to identify the minimal set of code metric changes that are likely to reduce defects, thereby helping developers prioritize or even ignore proposed code reorganizations that lack historical evidence of being effective.

The core idea behind XTREE is to avoid the "conjunctive fallacy," the common heuristic that code quality improves by simply reducing all code metrics that exceed certain thresholds. This approach ignores the complex interdependencies between metrics, where decreasing one metric (like Lines of Code) might necessitate increasing another (like coupling) if functionality is moved elsewhere. XTREE aims to find conjunctions of changes, understanding that sometimes increasing certain metrics can be beneficial.

XTREE operates as a "primary change oracle" and works in conjunction with a "secondary verification oracle."

XTREE: The Primary Change Oracle

The XTREE algorithm analyzes historical data containing code metrics and defect counts for software modules (e.g., classes or files). Its goal is to find a minimal set of code metric changes ( $\Delta$ ) that can move a module from a "defective" state to a "less defective" state, based on past project history.

Discretization and Tree Building: XTREE uses an iterative dichotomization process to discretize the continuous code metrics into a small number of ranges. It then builds a decision tree where internal nodes represent metrics and branches represent their discretized ranges. This process effectively clusters code modules with similar metric values and defect counts. The metric chosen at each split maximizes the reduction in defect variance across the resulting ranges.
Identifying Change Deltas ( $\Delta$ ): For a given code module identified as potentially defective (by its placement in a leaf node $C_+$ with a high defect percentage), XTREE seeks a "better" neighboring leaf node $C_-$ in the decision tree. A "better" node is one accessible from a nearby branch in the tree (specifically, a sibling node reachable from an ancestor node within a few levels) that has a significantly lower defect rate (e.g., $\leq 50\%$ of the defect rate of $C_+$ ).
Computing the Minimal Change Set: The $\Delta$ (set of recommended changes) is derived from the differences in the metric ranges along the path from the defective node $C_+$ to the better sibling node $C_-$ . For discrete attributes, the delta is the desired value. For numeric attributes represented as ranges $\left( LOW, HIGH\right]$ , the delta is any value within that target range. This process identifies a minimal set of attributes (those on the branching path difference) that, if changed appropriately, could move the module towards a less defective cluster.

XTREE prioritizes finding the nearest "better" sibling in the tree, rather than exhaustively searching for the absolute best one. This design choice helps XTREE quickly provide recommendations and, importantly, recommends changes to a small number of attributes.

The Secondary Verification Oracle

To evaluate whether the changes proposed by a primary change oracle (like XTREE, CD, Shatnawi, or Alves) are effective in reducing defect proneness, the framework uses a separate "secondary verification oracle." This oracle is trained on historical data to predict the defect proneness of a code module based on its metrics. It is trained independently of the primary oracle to provide an unbiased assessment.

Unlike previous approaches that used generic, potentially outdated quality models like QMOOD, XTREE employs Random Forests [Breiman2001] specifically trained on the project's own historical defect data. To improve the accuracy of the Random Forest predictor, especially when dealing with datasets where non-defective examples are a minority (class imbalance), the paper incorporates SMOTE (Synthetic Minority Over-sampling Technique) [chawla2002smote] and uses parameter tuning with differential evolution [storn97] to optimize metrics like the probability of detection (recall) and probability of false alarm. Only datasets where the tuned Random Forest could achieve adequate prediction performance ( $\mathit{pd}\ge 60 \wedge \mathit{pf} \le 30$ \%) were used for evaluation.

The effectiveness of a primary change oracle is measured by taking test code modules, applying the recommended metric changes (by simulating the change in metric values based on the oracle's $\Delta$ ), and then using the secondary verification oracle to predict the number of defects in the "changed" module ( $d_-$ ). This is compared to the predicted defects in the original module ( $d_+$ ), and the improvement is calculated as $100 * (1 - d_- / d_+)$ .

Practical Evaluation and Results

The paper evaluates XTREE against three other methods for recommending changes: outlier statistics methods by Shatnawi [Shatnawi10] and Alves et al. [Alves2010], and the cluster delta method CD [me12c]. The evaluation used five Java datasets (Ant, Ivy, Lucene, Jedit, Poi) from the Jureczko et al. collection [jureczko10]. The methods were trained on older versions and tested on a newer version of each project. Multiple runs were conducted due to the stochastic nature of some methods. The Scott-Knott test, augmented with effect size (A12) and bootstrapping, was used to statistically compare the performance of the methods.

The key findings (answering the research questions):

Effectiveness (RQ1): XTREE consistently achieved high or tied for the highest predicted defect reduction compared to other methods across the datasets (Figure 6 in the paper).
Succinctness (RQ2): XTREE recommended changes to far fewer code metrics than the other methods (Figure 7). While other methods often recommended changing nearly all metrics, XTREE focused on a small subset.
Stopping Power (RQ3): XTREE's recommendations were very sparse. For the studied projects, XTREE recommended changes to only 1-4 out of the 20 static code attributes. This provides clear guidance that code reorganizations based on the remaining 16-19 attributes are unlikely to yield benefits based on historical data and can be deprecated.
Stability (RQ4): The direction of the recommended changes (increase or decrease) by XTREE was highly stable across repeated runs (Figure 8).
Conjunctive Fallacy (RQ5): Confirming intuition and contradicting the naive "reduce all outliers" approach, XTREE often recommended increasing certain static code attributes while decreasing others like Lines of Code. For example, for the 'Poi' dataset, XTREE suggested decreasing LOC but increasing Coupling Between Objects (CBO), reflecting that breaking up large classes necessitates increasing coupling to other modules. This highlights that the optimal change strategy is project-specific and often involves trade-offs.

Implementation and Application

Implementing XTREE involves:

Data Collection: Gather historical data for your software project, including relevant code metrics for each module and the number (or presence/absence) of defects found in those modules post-release. The paper uses standard OO metrics (LOC, WMC, CBO, etc.).
Building the Verification Oracle:
- Use historical data (excluding the latest version for testing) to train a defect prediction model, such as a Random Forest classifier.
- Address class imbalance (if defect-prone modules are a minority) using techniques like SMOTE.
- Tune the classifier parameters using an optimization algorithm (like differential evolution) to improve its predictive performance (maximizing recall while minimizing false alarms).
Building the Primary Change Oracle (XTREE):
- Implement the iterative dichotomization algorithm to discretize code metrics based on defect variance.
- Build a decision tree using the discretized metrics on the training data.
- For each module in the test set: traverse the module down the tree to a leaf node. Find a "better" sibling leaf node (lower defect percentage) within a reasonable tree level distance.
- Identify the metric ranges that define the path difference between the current (defective) leaf and the target (less defective) sibling. These differences constitute the recommended changes $\Delta$ .
Applying and Evaluating Changes:
- For each recommended change $\Delta$ for a test module, simulate updating the module's metric values to be within the target range of the delta.
- Use the trained secondary verification oracle to predict the defect proneness of the module after the simulated changes.
- Compare this prediction to the oracle's prediction for the module before the changes to quantify the potential improvement.

Implementation Considerations:

Data Availability: XTREE and the verification oracle rely heavily on historical project data. This might be a limitation for new projects or projects without robust bug tracking and metric collection.
Metric Selection: The chosen metrics should be relevant to defect proneness.
Verification Oracle Accuracy: The reliability of the recommendations depends on the accuracy of the secondary oracle in predicting defect proneness based on metrics.
Simulating Changes: The evaluation simulates metric changes. Real-world code reorganization is complex, and achieving the exact target metric values while preserving functionality might be challenging.
Computational Cost: Training Random Forests and building/traversing decision trees are computationally feasible, but the tuning process can add overhead. Scaling to very large codebases might require optimization.
Deployment: XTREE could be integrated into CI/CD pipelines or static analysis tools to provide automated, evidence-based recommendations for code reorganization tasks reported by traditional smell detectors.

Practical Takeaways for Practitioners:

Don't blindly trust general bad smell definitions or thresholds from textbooks or tools. Code smells are project-specific.
Leverage your project's historical data (code metrics and defects) to identify which potential "smells" are actually correlated with defects in your codebase.
Use a data-driven approach like XTREE to identify specific metrics that, if changed, are likely to reduce defect proneness.
Be aware that improving code quality might involve increasing some metrics (like coupling) while decreasing others, contrary to simplistic outlier-based rules.
Focus reorganization efforts on the small subset of metrics identified as impactful by a tool like XTREE, ignoring the rest. This minimizes wasted effort.

In conclusion, XTREE provides a practical, evidence-based framework that goes beyond generic bad smell detection to offer project-specific, actionable recommendations for code reorganization, prioritizing those changes historically proven to reduce defects and explicitly deprecating those that are not.

PDF Markdown

Less is More: Minimizing Code Reorganization using XTREE (1609.03614v4)

Summary

Related Papers