Methods2Test: A dataset of focal methods mapped to test cases (2203.12776v1)

Published 23 Mar 2022 in cs.SE

Abstract: Unit testing is an essential part of the software development process, which helps to identify issues with source code in early stages of development and prevent regressions. Machine learning has emerged as viable approach to help software developers generate automated unit tests. However, generating reliable unit test cases that are semantically correct and capable of catching software bugs or unintended behavior via machine learning requires large, metadata-rich, datasets. In this paper we present Methods2Test: A dataset of focal methods mapped to test cases: a large, supervised dataset of test cases mapped to corresponding methods under test (i.e., focal methods). This dataset contains 780,944 pairs of JUnit tests and focal methods, extracted from a total of 91,385 Java open source projects hosted on GitHub with licenses permitting re-distribution. The main challenge behind the creation of the Methods2Test was to establish a reliable mapping between a test case and the relevant focal method. To this aim, we designed a set of heuristics, based on developers' best practices in software testing, which identify the likely focal method for a given test case. To facilitate further analysis, we store a rich set of metadata for each method-test pair in JSON-formatted files. Additionally, we extract textual corpus from the dataset at different context levels, which we provide both in raw and tokenized forms, in order to enable researchers to train and evaluate machine learning models for Automated Test Generation. Methods2Test is publicly available at: https://github.com/microsoft/methods2test

Authors (4)

Michele Tufano (28 papers)
Shao Kun Deng (5 papers)
Neel Sundaresan (38 papers)
Alexey Svyatkovskiy (30 papers)

Citations (20)

View on Semantic Scholar

Summary

An Expert Overview of Methods2Test: A Dataset of Focal Methods Mapped to Test Cases

The manuscript "Methods2Test: A Dataset of Focal Methods Mapped to Test Cases" presents an extensive effort to bridge the gap in resources available for ML research focused on automated unit test generation. While recent advancements have witnessed the application of ML techniques across various domains of software engineering, the task of automated test generation has faced challenges primarily due to the lack of large, high-quality datasets. This paper addresses this obstacle by introducing a substantial dataset comprising over 780,000 JUnit test cases paired with their corresponding focal methods, meticulously extracted from 91,385 Java projects hosted on GitHub.

Contribution and Methodology

The contribution of this work is multifaceted. Primarily, it introduces a comprehensive dataset with explicit mappings between test cases and the methods under test. The dataset is carefully curated using a set of heuristics based on best practices in software testing, aimed at reliably identifying focal methods for given test cases. These mappings are essential as they aid in the accurate training of ML models for generating semantically correct and useful unit tests.

The dataset is enriched with a wealth of metadata accompanying each method-test pair, preserved in JSON format, thereby enabling in-depth analysis and utility beyond mere test generation. The authors detail the methodology employed to assemble this dataset, which includes parsing project repositories to extract class and method metadata, identifying test classes and focal classes, and mapping the test cases to the focal methods through sophisticated heuristics like path matching and name similarity.

Dataset Composition and Structure

The dataset structure thoughtfully incorporates different contextual levels around the focal methods, from the direct method code to broader class context including constructors, method signatures, and class fields. This nuanced representation is critical for ML models to potentially leverage additional context that might inform better test case generation.

The data is segmented into training, validation, and test sets to facilitate robust development and evaluation of ML models. This segmentation is designed with a keen attention to avoiding data leakage, enforcing repository-level isolation between these datasets.

Implications and Potential Future Work

The introduction of Methods2Test is expected to fuel advancements in the domain of automated test generation by providing a large-scale, high-quality training resource. It supports the exploration of various ML models beyond those initially targeted by the authors, such as Configurable Encoder-Decoder models predominant in natural language processing tasks.

The public availability of this data, coupled with its diverse contextual layers, opens avenues for various applications, extending from ML-based test generation to empirical studies in software testing, pattern identification, and possibly even automated bug fixing.

Future developments might involve extending this dataset to include other programming languages and testing frameworks, hence broadening its applicability and potential impact. Additionally, integrating runtime information and code coverage metrics could further enhance its usage in assessing and improving test case generation strategies.

Conclusion

In conclusion, "Methods2Test" offers a pivotal resource that stands to significantly impact the field of software testing and automated test generation by addressing a critical gap. This dataset not only elevates the potential of ML applications in software engineering but also enriches the research ecosystem with a robust and versatile tool for exploring novel methods and models in software testing and beyond.

PDF Markdown

Related Papers

GitHub

GitHub - microsoft/methods2test: methods2test is a supervised dataset consisting of Test Cases and their corresponding Focal Methods from a set of Java software repositories (124 stars)