Papers
Topics
Authors
Recent
Search
2000 character limit reached

SV-COMP Software Verification Competition

Updated 6 February 2026
  • SV-COMP is an annual, community-driven competition that benchmarks fully automatic software verification tools using a large, standardized suite of tasks.
  • It employs strict resource limits and scoring rules to uniformly evaluate verifiers across diverse properties like safety, termination, and concurrency.
  • The competition drives innovation and reproducibility in software verification through transparent artifact sharing, standardized output, and rigorous validation protocols.

The Software Verification Competition (SV-COMP) is an annual, community-driven benchmarking event that establishes the empirical standard for evaluating and comparing fully automatic software verification tools. SV-COMP provides a large-scale, standardized suite of verification tasks across multiple programming languages (notably C and Java) and properties (including safety, termination, memory safety, and concurrency). Tools are evaluated under uniform resource constraints and strictly defined protocols, with scoring and ranking systems designed to reward correct answers and penalize unsound or incomplete verification outcomes. The design and evolution of the SV-COMP ecosystem has significantly influenced both verification research methodology and the development of state-of-the-art automated verification tools (Moloney et al., 4 Feb 2026, Gerhold et al., 2023, Cordeiro et al., 2018, Dacík et al., 27 Feb 2025, Sultan et al., 26 Jan 2026).

1. Objectives and Historical Foundations

SV-COMP’s inception addressed the need for a reproducible and transparent framework for the empirical assessment of software verification technology. The competition has been held annually since 2012, originally focused on C programs and expanded to Java and other languages in subsequent editions. Its stated aims are to:

  • Measure the practical capabilities of automatic software verification tools across a broad spectrum of crafted and real-world code.
  • Drive progress by establishing a common corpus of benchmarks and imposing uniform execution and reporting standards.
  • Ensure comparability and fairness via resource isolation, automated result aggregation, and well-defined scoring semantics.

The SV-COMP benchmark suite spans thousands of verification tasks per category (e.g., 24,391 tasks in SV-COMP 2023, including 23,805 in C and 586 in Java (Gerhold et al., 2023)). Properties checked include safety (assertion invariants), termination, concurrency (absence of races), arithmetic overflow, and system-specific contracts. The Java category was introduced in 2019 to address the lack of standardized comparison for research tools in the Java ecosystem (Cordeiro et al., 2018).

2. Competition Structure and Workflow

Each SV-COMP track is defined by:

  • Categories, each corresponding to a specific property or class of properties (e.g., ReachSafety, RuntimeException, Termination, MemSafety, ConcurrencySafety, NoDataRace).
  • Benchmarks, consisting of a program (source files) and a property specification (typically in .prp format). Each task defines a property such as “no assertion fails” (reachability), “no runtime exception,” or “program terminates.”
  • Tool Execution, orchestrated by the BenchExec benchmarking framework, which standardizes resource limits (CPU time, memory), scheduling, and result isolation. For example, SV-COMP 2023 used a soft timeout of 900 s, hard timeout ≈960 s, and 15 GB RAM per run (Gerhold et al., 2023).
  • Standardized Output: Tools must produce verdicts (such as TRUE, FALSE, or UNKNOWN) and, where required, witnesses in a specified format.

The artifact workflow incorporates versioned infrastructure, scripts for invocation and result aggregation, validation of verdicts (correct, incorrect, unconfirmed), and aggregation for rankings. The official repository hosts all benchmarks and property specifications, as well as tool integration modules and results. Each tool submission undergoes smoke testing and compliance checks prior to the official execution (Cordeiro et al., 2018, Gerhold et al., 2023).

3. Scoring, Evaluation Metrics, and Result Validation

SV-COMP employs property- and track-specific scoring rules to rank tools. Scoring is asymmetric to penalize unsound answers more heavily.

Example: Java (ReachSafety, SV-COMP 2019–2026)

Verdict Points Comment
TRUE (correct) +2 Safe proved
FALSE (correct) +1 Bug found
UNKNOWN 0 Timeout, error, inconclusive
FALSE (incorrect) –16 False alarm
TRUE (incorrect) –32 Missed real bug

For C termination in SV-COMP 2025, scores are similarly asymmetric for correct proofs and counterexamples, with even zero credit for NT predictions lacking valid witnesses (Sultan et al., 26 Jan 2026). Metrics such as accuracy, recall, precision, specificity, and undecidable rate are systematically reported and compared, both excluding and including “unknown/error/timeout” as failures. Reproduction studies confirm that rankings are robust to minor runtime/memory fluctuations, although more fine-grained documentation is advocated for full reproducibility (Gerhold et al., 2023).

4. Benchmark Suite Composition and Impact

SV-COMP’s benchmark composition exerts a direct influence on tool development and evaluation outcomes. An over-representation of certain programming idioms or under-tested language features can drive inadvertent overfitting, while a more diverse suite encourages broadly applicable verifier innovation (Moloney et al., 4 Feb 2026). The Java track, for instance, contains hundreds of tasks partitioned by property (ReachSafety, ExceptionProperty, etc.) (Moloney et al., 4 Feb 2026, Cordeiro et al., 2018).

Efforts to expand realism and diversity include automatic benchmark generation (notably with ARG-V), which mines, filters, and transforms open-source Java code. ARG-V’s formal model selects files based on syntactic criteria (e.g., minimum number of if statements and the presence of primitive conditions), strips external dependencies, injects controlled nondeterminism, and converts results to SV-COMP’s input format. Empirically, the addition of new ARG-V–generated benchmarks leads to a notable drop in recall and increased undecidable verdicts across leading Java verifiers—demonstrating potential overfitting to the established suite and exposing previously under-tested behaviors (e.g., inter-procedural flow, floating-point conditions) (Moloney et al., 4 Feb 2026).

5. Representative Tools, Categories, and Experimental Results

SV-COMP rigorously evaluates diverse classes of verifiers:

  • Java Verifiers: MLB (ML-steered symbolic execution), GDart (dynamic symbolic execution ensemble), JavaRanger (region summarization), JBMC (bounded model checking for Java bytecode) (Moloney et al., 4 Feb 2026). Results consistently show that all top verifiers experience marked performance degradation on new, realistic benchmarks (e.g., cumulative recall falling from 0.75 to 0.64; accuracy from 0.83 to 0.72; and undecidable rates doubling for challenging programs).
  • C Tools (e.g., Data Race Detection): RacerF (Frama-C plugin), which applies abstract interpretation and dual-mode (under- and over-approximation) for thread race detection, achieved second place in NoDataRace-Main, handling all tested Linux kernel tasks and providing the largest number of correct answers among non-metaverifiers (Dacík et al., 27 Feb 2025).
  • LLM-based Analysis: Recent work has evaluated GPT-5, Claude Sonnet-4.5, and CWM on C termination tasks, finding that, under consensus-based test-time scaling, these models approach the performance of top traditional tools—albeit with significant limitations in witness production and degradation as code length increases (Sultan et al., 26 Jan 2026).
Category Metric Existing New (ARG-V)
ReachSafety Accuracy 1.00 0.72
Recall 1.00 0.49
Undecidable 25% 55%
RuntimeException Accuracy 0.67 0.72
Recall 0.66 0.73
Undecidable 26% 52%

A marked increase in undecidable verdicts, particularly for new realistic benchmarks, highlights the need for ongoing benchmark suite expansion and suggests that leading verifiers have not yet fully generalized to more realistic code distributions.

6. Infrastructure, Reproducibility, and Community Practices

SV-COMP employs substantial technical infrastructure to ensure experimental rigor, reproducibility, and transparency:

  • BenchExec Framework: Provides process isolation, resource accounting, and standardized result collection (Cordeiro et al., 2018).
  • Artifact Sharing: All task definitions, tool modules, logs, and final reports are published, permitting independent reproduction of experiments (Gerhold et al., 2023).
  • Result Validation: Witnesses and verdicts are subjected to further automated validation; discrepancies are penalized by scoring rules.
  • Reproducibility Studies: Independent re-executions of competition artifacts have confirmed that overall rankings remain stable under minor environmental variation, with recommendations for improved user documentation and containerized environments for future editions (Gerhold et al., 2023).
  • Community Engagement: The competition process includes training periods, public benchmarks, and mechanisms for participant feedback and infrastructure contributions (Cordeiro et al., 2018).

7. Implications, Limitations, and Future Directions

SV-COMP’s central position in the verification community ensures that it both reflects and shapes contemporary methodological standards.

  • Benchmark Evolution: Ongoing integration of automatically generated, realistic benchmarks (e.g., via ARG-V) is advocated to counteract overfitting and reveal new verifier limitations (Moloney et al., 4 Feb 2026).
  • Property Expansion: New tracks, such as advanced concurrency properties, data race freedom, and quantitative resource bounds, are under active consideration (Dacík et al., 27 Feb 2025, Cordeiro et al., 2018).
  • LLM and Neuro-Symbolic Integration: Recent results highlight the potential and challenges of integrating LLM-based reasoning with classical verification tools, particularly regarding the generation of formal witnesses, and underscore sensitivity to task complexity and prompt design (Sultan et al., 26 Jan 2026).
  • Reproducibility and Artifact Sharing: Improved end-to-end documentation, automated environment provisioning, and regular independent audit runs are recommended to further enhance reproducibility (Gerhold et al., 2023).
  • Real-World Applicability: Systematic expansion to cover more industrial and open-source software is necessary to ensure that tool improvements generalize beyond academic exemplars (Moloney et al., 4 Feb 2026).

Initiatives such as ARG-V, as well as quantitative performance assessments across evolving benchmarks and properties, position SV-COMP as an essential driver for incremental and sustained improvement in the field of program verification.

Topic to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to SV-COMP Competition.