Traceability in the Wild: Automatically Augmenting Incomplete Trace Links (1804.02433v1)

Published 6 Apr 2018 in cs.SE

Abstract: Software and systems traceability is widely accepted as an essential element for supporting many software development tasks. Today's version control systems provide inbuilt features that allow developers to tag each commit with one or more issue ID, thereby providing the building blocks from which project-wide traceability can be established between feature requests, bug fixes, commits, source code, and specific developers. However, our analysis of six open source projects showed that on average only 60% of the commits were linked to specific issues. Without these fundamental links the entire set of project-wide links will be incomplete, and therefore not trustworthy. In this paper we address the fundamental problem of missing links between commits and issues. Our approach leverages a combination of process and text-related features characterizing issues and code changes to train a classifier to identify missing issue tags in commit messages, thereby generating the missing links. We conducted a series of experiments to evaluate our approach against six open source projects and showed that it was able to effectively recommend links for tagging issues at an average of 96% recall and 33% precision. In a related task for augmenting a set of existing trace links, the classifier returned precision at levels greater than 89% in all projects and recall of 50%

Citations (98)

View on Semantic Scholar

Summary

The paper proposes a supervised classifier to automatically detect and augment missing commit-issue trace links in open-source projects.
It employs a composite feature set of process attributes and textual similarity, achieving a 96% recall as a recommender and over 90% precision in augmentation.
The study demonstrates practical improvements in traceability, enabling more effective safety analysis and change impact analysis across software projects.

An Analytical Overview of "Traceability in the Wild: Automatically Augmenting Incomplete Trace Links"

The paper "Traceability in the Wild: Automatically Augmenting Incomplete Trace Links" provides a thorough examination of software and systems traceability, concentrating on the typically incomplete links between software commits and issue reports in open-source projects. The authors seek to address the gap in traceability by developing a supervised classifier to automatically detect and augment missing links between these entities.

The authors begin by identifying the problem: a significant number of commits in version control systems lack explicit tags linking them to specific issues. Analyzing six large open-source projects, they found that only about 60% of commits were linked to issues. This lack of traceability can undermine several critical software engineering activities, such as safety analysis and change impact analysis.

To tackle the challenge of missing trace links, the authors propose a machine learning-based approach. They designed a classifier utilizing a variety of features categorized into process-related and textual similarity attributes. These features are meticulously selected to capture the characteristics of commit-issue pairs, covering stakeholder-related information, temporal relations, structural attributes, and textual similarity using information retrieval techniques such as VSM-nGram.

The classifier is trained on this composite feature set to predict whether a given commit and issue should be linked. The paper employs Random Forests as the primary classifier given its robustness and effectiveness in similar domains. It was trained using datasets from the same six open-source projects under examination.

The evaluation is carried out in two main scenarios. Firstly, the classifier is employed as a recommender system to assist developers by suggesting likely issue links at the time of committing changes. The authors report a high recall of 96% and a moderate precision of 33%, indicating that the system can effectively provide developers with relevant suggestions, potentially reducing the traceability gaps as they occur.

In the second scenario, the classifier is used to augment existing trace links automatically. Here, precision becomes a critical factor since incorrect links might undermine the credibility of trace data. The authors achieve a commendable precision of over 90% with an average recall of 50% across the projects, highlighting the system's capability to enhance traceability with reliability.

The authors' approach is validated with a manual inspection of links proposed for previously unlinked commits, confirming the classifier's practical applicability in real-world scenarios. However, certain limitations such as variability in data validation and domain dependence were noted, acknowledging that the approach, while effective within the studied environments, requires adaptation for different development settings or systems.

The implications of this research are significant in the context of software engineering. Practically, it offers a method to increase project traceability without imposing additional burdens on developers, using a minimal interface that suggests potential links at the point of commit. Theoretically, it adds to the discourse on automated traceability, proposing a viable fusion of process understanding and textual analysis via machine learning.

Future developments could focus on expanding this approach to accommodate different software development methodologies or extend the classifier's functionality to cover more types of artifacts and relationships, further closing the gap in project-wide traceability.

In conclusion, the paper presents a thoughtful and empirically validated approach to augmenting trace links in software projects, demonstrating both feasible execution and promising impacts on enhancing software traceability. This paper is a substantive contribution to improving reliability and traceability in software engineering environments.

PDF Markdown

Traceability in the Wild: Automatically Augmenting Incomplete Trace Links (1804.02433v1)

Summary

An Analytical Overview of "Traceability in the Wild: Automatically Augmenting Incomplete Trace Links"

Related Papers