ARG-V Tool: Realistic Java Benchmarks
- ARG-V Tool is a benchmark generator that extracts and transforms authentic Java code from GitHub into SV-COMP-ready verification tasks.
- It uses a modular three-stage pipeline (Download, Filter, Transform) to maintain real-world control flow and formal verification properties.
- Experimental evaluations show that ARG-V benchmarks expose verification tool limitations by lowering metrics like accuracy and recall compared to traditional suites.
ARG-V Tool
ARG-V—Adaptable Realistic Benchmark Generator for Verification—is a tool designed to automate the generation of realistic Java verification benchmarks in the SV-COMP (Software Verification Competition) format. Its principal function is to mine, filter, and transform code derived from real-world GitHub repositories to yield SV-COMP-ready verification tasks that maintain complex control flow and authentic computational logic, deliberately avoiding synthetic artifacts. ARG-V has also appeared in other domains (such as the ARG virtual argumentation tool for legal theory (Silva et al., 2015)), but in current verification research, reference is to the benchmark generation system (Moloney et al., 4 Feb 2026).
1. System Architecture and Pipeline
ARG-V implements a modular architecture consisting of three sequential processing stages: Download, Filter, and Transform. The input is a CSV file listing GitHub repositories, which is processed as follows:
- Download Module: Clones or fetches repositories based on a curated list (typically from RepoReaper).
- Filter Module: Parses Java source files using Eclipse JDT's AST, quantifies program features (such as if-statements, loop constructs, branching on primitive types), and selects only those files meeting predefined filter criteria.
- Transform Module: Rewrites the filtered ASTs by removing non-JDK dependencies, renaming packages/classes to conform to SV-COMP standards, injecting verification harnesses (including requisite Verifier API calls), and finally emitting both the Java benchmark file and its YAML configuration.
A schematic of this pipeline is:
| Module | Input | Output |
|---|---|---|
| Download | GitHub repo CSV | Local raw Java files |
| Filter | Raw Java files | “Interesting” Java files |
| Transform | Filtered ASTs | SV-COMP .java + config .yml |
The pipeline is formalized as a function:
where is a benchmark and is the property under verification (ReachSafety or ExceptionProperty) (Moloney et al., 4 Feb 2026).
2. Formalization and Program Manipulation
For each input, ARG-V models programs as labeled transition systems , with as the state space, the entry state, and representing Java bytecode semantics. Verification properties are preserved or injected as:
- ReachSafety: no assertion fails on any execution path.
- ExceptionProperty: no runtime exception is ever thrown.
Verification conditions (VCs) are encoded as:
where is the assertion condition inserted during transformation. This process ensures that the benchmarks' safety properties are formally and transparently enforced for evaluation by verification tools (Moloney et al., 4 Feb 2026).
3. Benchmark Generation Criteria and Realism
To ensure realism, ARG-V benchmarks must:
- Originate from non-synthetic, authentic GitHub code (no trivial toy examples).
- Meet minimum complexity thresholds: at least one if-statement, control flow involving primitive-typed branches, and no use of unsupported libraries.
- Retain the average lines of code and computational complexity observed in SV-COMP’s existing Java suite (mean LOC: 41.7 vs. 42.2).
- Exclude deep recursion and exotic library dependencies via filter configuration and manual post-processing.
This methodology precludes black-box or ML-based scoring for deduplication, favoring deterministic, filter-driven feature selection. Such constraints ensure that ARG-V benchmarks reflect the verification challenges encountered in real-world codebases (Moloney et al., 4 Feb 2026).
4. Integration with SV-COMP and External Tools
ARG-V integrates with established verification and benchmarking platforms:
- Parsing: AST construction via Eclipse JDT, with type resolution for all
java.*andjavax.*classes. - Repository Selection: Source list typically drawn from the RepoReaper project.
- Benchmark Harnessing: Transformed files are wrapped with a main method, property checks (via Verifier API), and SV-COMP-compliant configuration.
- Evaluation: BenchExec is employed for verifying tool performance, enabling parallel and constrained execution across multiple verifiers (JBMC, MLB, JavaRanger, and GDart).
The output bundle for each benchmark includes the Java source and its YAML config specifying property, expected verdict, and metadata (originating repo, LOC, feature counts) (Moloney et al., 4 Feb 2026).
5. Experimental Evaluation and Impact
ARG-V generated a corpus of 68 novel Java benchmarks (48 ReachSafety, 50 ExceptionProperty tasks; totaling 98 property checks). In controlled experiments comparing performance on these to the existing SV-COMP suite, four leading Java verifiers (JBMC, MLB, JavaRanger, GDart) demonstrated a decline in accuracy and recall:
| Metric | Existing SV-COMP | ARG-V Benchmarks |
|---|---|---|
| Accuracy | 0.83 | 0.72 |
| Precision | 0.99 | 0.90 |
| Recall | 0.75 | 0.64 |
| Specificity | 0.99 | 0.88 |
Notably, the recall on new benchmarks dropped (from 1.00 to 0.49 for ReachSafety), and the proportion of “unknown” answers (timeouts/indecidable) exceeded 50% on some tools. This indicates that ARG-V-generated programs expose weaknesses or “blind spots” in current verification technology, despite their comparable size and intended complexity matching the traditional SV-COMP corpus (Moloney et al., 4 Feb 2026).
6. Extensibility and Future Work
Planned extensions for ARG-V include:
- More granular filter options (e.g., minimal loop nesting, maximum call depth thresholds).
- Enhanced support for additional Java library classes and complex data structures.
- Extension to C/C++ via an “argc-transformer.”
- ML-powered deduplication to prevent overlapping with benchmarks already in use.
Verifier toolchains can leverage ARG-V’s filters to synthesize targeted challenge sets, such as those emphasizing floating-point control flow or intricate loop patterns, facilitating focused stress testing and verifier development (Moloney et al., 4 Feb 2026).
7. Distinction from Other ARG-V Systems
ARG-V as discussed above is distinct from earlier systems such as the ARG virtual argumentation training platform for law (sometimes also called ARG-V), which implements Toulmin’s model of argumentation for juridical reasoning and classroom instruction (Silva et al., 2015). In the context of program verification, ARG-V universally refers to the benchmark generation system described in (Moloney et al., 4 Feb 2026).