Classes2Test Dataset

Updated 2 December 2025

Classes2Test is an annotated corpus that maps Java classes under test to their corresponding human-written JUnit tests for comprehensive evaluation.
The dataset uses rigorous filtering, version pinning, and AST analysis to ensure reproducible and reliable class–test pair extraction.
It provides detailed benchmarking metrics like code coverage, mutation scores, and test smells, integral to the AgoneTest framework.

Classes2Test is an open-source, annotated corpus designed for comprehensive evaluation of unit test generation at the class level in Java software projects. The dataset maps Java classes under test to their corresponding human-written JUnit test classes and serves as a standardized ground truth for benchmarking automated systems—especially LLM-driven test generation workflows—under realistic, end-to-end development conditions. Classes2Test is central to the AgoneTest framework, which provides a suite of advanced evaluation metrics for both human and LLM-generated tests (Lops et al., 25 Nov 2025).

1. Dataset Construction and Methodological Pipeline

Classes2Test is constructed atop the Methods2Test corpus (Tufano et al. 2022), which initially mined 91,385 GitHub repositories to associate focal methods with their corresponding test methods. Rigorous filtering was applied to Methods2Test for Classes2Test:

Project Selection: Projects retained for Classes2Test must possess at least one method-test mapping, compile successfully with all dependencies resolved, not be forks/duplicates, and show activity within the last five years. This filtration yields 9,410 active, compilable Java repositories—approximately 10% of the initial corpus.
Version Pinning: For each repository, Classes2Test records the GitHub URL, default branch, and the exact commit hash used for extraction. This ensures reproducibility concerning Java, dependency, and project structure.
Class-Test Pair Identification: Discovery proceeds in three stages:
1. Naming Conventions: Test classes are harvested from src/test/java matching common patterns (e.g., MyClassTest, TestMyClass).
2. Static AST Analysis: Java Abstract Syntax Tree (AST) parsing confirms structural evidence, such as imports referencing the class under test, constructor calls, and Mockito stubs.
3. Evidence-Ratio Filtering: Where a test class references multiple classes under test, a reference ratio is computed. The mapping is retained only if one class under test accounts for ≥ 60% of all references; mappings below threshold or with ambiguity are excluded (< 0.002% of candidate pairs).
Normalization and Annotation: Duplicates and measurement artifacts are removed, and each class–test pair is annotated with metadata (commit hash, Java/JUnit version, LOC, cyclomatic complexity).

2. Dataset Properties and Structural Features

Classes2Test offers substantial scale and granularity:

Repositories: 9,410 unique Java projects.
Test Classes and Pairs: 147,473 mapped test classes, yielding an estimated ~147,000 class–test pairs.
Code Metrics: Average lines of code (LOC) for classes under test is 1,178 (interquartile range, IQR: 420–1,960); cyclomatic complexity averages 55.3 (IQR: 18–92).
Ecosystem Composition: Distribution of frameworks and Java versions:
- JUnit 4 (55%), JUnit 5 (41%), other (TestNG, etc.) (4%)
- Java 11 (42%), Java 17 (25%), Java 21+ (18%), Java 8 (14%)

A representative example includes a class KeyManager.java under src/main/java/com/example, paired with KeyManagerTest.java in src/test/java/com/example, where the test exercises core methods using JUnit assertions.

3. Data Schema, File Organization, and Metadata

Repositories are organized as top-level folders named after their GitHub owner and repository, containing:

src/main/java/…: source classes
src/test/java/…: test classes
classes2test-manifest.csv: index file with class–test mappings

The manifest CSV holds columns:

repo_url: GitHub clone URL
commit_hash: pinned commit SHA
class_path: relative path to class under test
test_class_path: relative path to test class
java_version: project LTS version
test_framework: JUnit4/5/other
loc: lines of code for class under test
cyclo: cyclomatic complexity

repo_url	commit_hash	class_path	test_class_path	java_version	test_framework	loc	cyclo
https://github.com/foo/bar	f5e3a17	src/main/java/com/foo/Bar.java	src/test/java/com/foo/BarTest.java	11	JUnit5	530	24

4. Evaluation Criteria and Metrics

Classes2Test serves as ground truth for AgoneTest’s quantification of test suite performance. The following metrics are reported over any human or LLM-generated test suite executed on the dataset:

Code Coverage (JaCoCo):
- Line coverage (%)
- Method coverage (%)
- Branch coverage (%)
Mutation Score (PiTest):
- $M = \frac{\text{KilledMutants}}{\text{TotalMutants}} \times 100\%$ —measuring suite’s ability to detect injected faults.
Test Smells (tsDetect):
- Enumeration of test-code anti-patterns, including Assertion Roulette (AR), Conditional Test Logic (CTL), Eager Test (EA), among others.
Compiled-Only Averages: For metric $m_i$ (coverage/test smell count) and compilation indicator $build_i \in \{0,1\}$ ,

$\overline{m} = \frac{1}{N_{comp}} \sum_{i=1}^N (build_i \times m_i)$

where $N_{comp} = \sum_{i=1}^N build_i$ .

Empirical statistics for human-written tests: branch coverage 48.7%, line coverage 73.2%, mutation score 40.4%, with test smell occurrences detailed in the source (Lops et al., 25 Nov 2025).

5. Benchmarking Workflows and Use Cases

AgoneTest utilizes Classes2Test as the foundation for several experimental protocols:

Project Selection: Ingests classes2test-manifest.csv to choose class–test pairs for analysis.
Automated Setup: Clones repositories at recorded commit hashes, extracting metadata and setting up Java/JUnit environments.
Test Generation: Feeds class code, with optional few-shot exemplars from Classes2Test, to an LLM (e.g., GPT-4, Gemini, Llama) to generate new test classes.
Assessment and Reporting: Executes pipelines (mvn test, gradle test), collects coverage, mutation, and test smell reports, and aggregates results per class and per LLM/prompt, always juxtaposed against the human baseline.
Benchmarking Modes:
- LLM comparison: Evaluates multiple LLMs on shared subsets.
- Prompt engineering ablation: Zero-shot vs few-shot generation using in-dataset exemplars.
- Error analysis and autocompletion: Enhanced prompting incorporating precise class paths for import resolution.

This structure enables robust, reproducible comparison of LLM performance and prompt designs using extensible, real-world software artifacts.

6. Context, Significance, and Implications

Classes2Test advances the rigor of automated unit test generation research by providing a realistic, annotated corpus with reproducible extraction protocols and granular metadata. By aligning benchmarks with practical software engineering environments (real projects, actual dependency graphs, and class/test structure), Classes2Test mitigates confounding factors common to synthetic or isolated benchmarks. Its integration in AgoneTest further standardizes metrics—branch/line/method coverage, mutation scores, and test smells—enabling nuanced assessment of both human and machine-generated unit tests.

A plausible implication is that the use of Classes2Test, in concert with AgoneTest, may guide improvements in LLM model design, prompt engineering strategies, and industrial best practices for automated software testing. The empirical findings that LLM-generated tests frequently match or outperform human baselines (for compilable cases) exemplify the dataset’s utility in driving data-centric research and iterative model evaluation (Lops et al., 25 Nov 2025).

Markdown Report Issue Upgrade to Chat

References (1)

LLMs for Automated Unit Test Generation and Assessment in Java: The AgoneTest Framework (2025)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Classes2Test Dataset.