C Benchmark Overview

Updated 22 June 2026

C benchmarks are curated datasets and protocols specifically designed to systematically evaluate and validate C code across multiple domains.
They target applications in vulnerability detection, formal verification, automated repair, and code translation with fine-grained annotations and rigorous metrics.
Construction principles include multi-level context extraction, performance metrics like precision and recall, and detailed analyses of current tool limitations.

A C benchmark is any curated dataset, protocol, or test suite explicitly designed for the systematic evaluation, comparison, or validation of software artifacts or research methodologies involving the C programming language. These benchmarks serve a variety of research and development domains, ranging from security and software verification, to neural network implementation and program repair, to code search, code translation, and real-world workload emulation. The following sections catalog prominent C benchmarks introduced in the literature, clarify their construction principles and empirical use, and summarize the methodological advances and limitations exposed by these resources.

1. C Benchmarks for Vulnerability and Security Evaluation

C benchmarks targeting software vulnerability detection provide ground-truth data and rigorously annotated testbeds for assessing static, dynamic, and AI-based security methods. The SecVulEval benchmark establishes a new state of the art for real-world C and C++ vulnerability detection by introducing statement-level annotation granularity, rich contextual information, and extensive coverage of realistic weaknesses (Ahmed et al., 26 May 2025).

SecVulEval: 25,440 function samples from 707 C/C++ codebases, spanning 5,867 unique CVEs and 145 CWE categories, with per-statement vulnerability and fix annotations. Each benchmark instance is accompanied by five levels of program context (function arguments, external functions, type declarations, global symbols, execution environment), automatically extracted and validated at 83% accuracy.
Evaluation Protocol: Multi-agent LLM pipelines decompose normalization, planning, context acquisition, detection, and validation. Empirical measurement uses function- and statement-level precision, recall, and F₁-score.
Outcomes: Performance of state-of-the-art LLMs is fundamentally limited at statement-level localization (maximum F₁ ≈ 23.8%), especially for complex or long functions, and is sensitive to context extraction quality. Explicit interprocedural and repository-wide context substantially improves practical detection but remains beyond current LLM self-reasoning capabilities.

2. Benchmarks for C Code Verification and Certification

Formal software verification in C remains a pivotal domain requiring benchmarks that stress low-level correctness, undefined behavior safety, and specification conformance.

“A benchmark for C program verification” (Eekelen et al., 2019): A 25-program suite covers minimal arithmetic fragments (factorial, cat), real-world library implementations (malloc, sort), and concurrency. Each program targets unique aspects—heap safety, pointer arithmetic, floating-point math, standard I/O—using actual C source (no abstraction layers). A transparent 0–4 point scoring per program rewards tools that prove absence of undefined behavior, functional correctness, faithful verification of unmodified C source, and not just high-level models. No baseline results are reported; the suite is explicitly designed for community evaluations and tool competitions.
NeuroCodeBench (Manino et al., 2023): Comprises 32 C programs encoding neural network computations and math routines, totaling 607 SV-COMP safety properties (bounds, monotonicity, symmetry, action-selection). Challenges include dependency on <math.h>, floating-point reasoning, and complex neural architectures. Major verification frameworks achieve highly incomplete or incorrect results, highlighting key bottlenecks in C math library modeling and floating-point support.

3. Automated Program Repair, Introductory Education, and Student Code

Benchmarks focused on C program repair and education have addressed semantic correctness and the diversity of student-generated code errors.

C-Pack-IPAs (Orvalho et al., 2022): Contains 1,264 student submission programs over 25 C90-based introductory assignments (“labs”), categorized as semantically correct (passes all tests), semantically incorrect, or syntactically incorrect (fails to compile). Each assignment is supplied with instructor reference solutions and exhaustive black-box test suites. The data structure is fit for benchmarking both syntactic and semantic program repair frameworks but lacks fine-grained error taxonomy.

4. Code Search, Retrieval, and Robustness in C/C++

Retrieval and search benchmarks for C/C++ code have addressed the paucity of domain-appropriate, compilable datasets for evaluating natural language–code retrieval, especially under conditions that stress semantic rather than lexical understanding.

CLARC (C/C++ LAnguage Retrieval with Anonymized Code) (Wang et al., 4 Mar 2026): 6,717 query–code pairs across 144 GitHub repositories, partitioned by dependency complexity (standard library only, custom types, or custom helpers). Each function is validated for successful compilation, and queries are produced by SOTA LLMs and validated via human hypothesis testing. Robustness stress tests replace or randomize identifiers and lower code to x86 Assembly and WebAssembly. All competitive retrieval models (Voyage, Nomic, OASIS, OpenAI embedding) exhibit significant performance drops under anonymization and compilation, highlighting deep reliance on lexical cues.
Empirical findings: Even after fine-tuning, embedding-based methods are unable to abstract away from identifier tokens or cope with code lowered to non-source representations.

5. Domain-Specific and Cross-Platform C Benchmarks

Other benchmarks address specialized domains or cross-language migration scenarios:

CRUST-Bench (Khatry et al., 21 Apr 2025): A comprehensive C-to-safe-Rust transpilation benchmark, comprising 100 open-source C repositories, each paired with Rust interfaces and Rust-native test suites. Evaluations reveal that high-quality, idiomatic Rust code generation remains an open challenge: SOTA LLMs solve only 15% of tasks in single-shot mode and up to 37% with self-repair. Error analysis underscores persistent failures in function signature matching, borrow/lifetime correctness, and multi-file coherence.
TPC-C for Blockchain Environments (Klenik et al., 2021): Represents the mapping of the TPC-C transactional workload to C-based chaincode for Hyperledger Fabric, with a Fabric+Caliper test harness for measuring classical OLTP metrics (throughput, latency, error ratio). The adaptation reveals that MVCC (multi-version concurrency control) conflicts form the primary scaling bottleneck in blockchainified C codebases.

6. Construction Principles, Metrics, and Limitations

C benchmarks define not only ground-truth data but also evaluation protocols, performance metrics, and detailed error/bias analysis.

Labeling and Granularity: High-quality benchmarks emphasize fine granularity (statement, function, or property-level), covering extensive CWE codes (SecVulEval), property types (NeuroCodeBench), or test suite–based correctness (C-Pack-IPAs).
Evaluation Frameworks: Metrics include precision, recall, and F₁ at multiple granularities (SecVulEval); mean reciprocal rank, recall@k (CLARC); binary classification accuracy (C-Pack-IPAs, CRUST-Bench, AI-detection C-ReD (Qing et al., 13 Apr 2026)); and property success counts (NeuroCodeBench).
Known Limitations: Common constraints are incomplete context extraction (limiting LLM performance), class imbalances in educational datasets, missing formal correctness metrics beyond test passing, and verification tool limitations with math libraries and floating-point semantics.

7. Impact, Applications, and Future Directions

C benchmarks have substantially influenced research in vulnerability detection, software verification, code retrieval, automated repair, and code translation. Empirical results consistently demonstrate the inadequacy of function-level or lexically oriented benchmarks for evaluating advanced AI-based tools. The field is moving toward benchmarks that:

Mandate statement- or property-level annotation with real-world context (SecVulEval).
Demand compilable, context-rich, multi-file code artifacts (CLARC, CRUST-Bench).
Stress test for robustness to identifier renaming, semantic obfuscation, or code lowering (CLARC).
Provide open, extensible protocols for grounding empirical claims and facilitating cross-domain or cross-language research.

A plausible implication is that future C benchmarks will need to integrate hybrid symbolic–neural methods, domain-specific static analysis, full program contexts, and richer negative data to foster semantically aware, robust, and context-sensitive software engineering tools (Ahmed et al., 26 May 2025, Wang et al., 4 Mar 2026, Khatry et al., 21 Apr 2025).