Papers
Topics
Authors
Recent
Search
2000 character limit reached

SecVulEval: Fine-Grained Vulnerability Benchmark

Updated 22 June 2026
  • SecVulEval is a benchmark framework that precisely localizes vulnerabilities at the statement level in real-world C/C++ code using LLMs.
  • It combines empirical benchmarking with automated adversarial analysis to overcome limitations of traditional function-level vulnerability datasets.
  • The framework employs a multi-agent pipeline and rigorous metrics to ensure explainable, context-aware detection and accurate vulnerability reasoning.

SecVulEval is a comprehensive benchmark framework for precise evaluation of vulnerability detection—especially in code produced or analyzed by LLMs—across real‐world C/C++ projects. The framework introduces statement‐level localization, leverages rich contextual information, and integrates methodologies from both empirical benchmarking and automated adversarial analysis. SecVulEval sets a rigorous standard for appraising the vulnerability detection capabilities of LLMs and related tools by demanding explainability, fine granularity, and context-aware threat modeling, and it addresses substantial prior limitations in evaluation datasets and protocols (Ahmed et al., 26 May 2025).

1. Evolution and Motivation

SecVulEval was created to overcome critical deficits in prior benchmarks. Historically, datasets for vulnerability detection focused on function-level binary labeling (vulnerable vs. not vulnerable), lacked sufficient variance in context, and exhibited high duplication and annotation inconsistencies. Such datasets did not reflect the real-world chains of reasoning required to identify the locus and cause of security flaws in large, mature C/C++ codebases (Ahmed et al., 26 May 2025). Additionally, these coarse granularity benchmarks masked whether a tool (or LLM) actually understood the underlying vulnerability root cause, or was merely learning to classify functions by generic patterns.

Function-level approaches also failed to supply the contextual cues—function arguments, external calls, types, macros, environmental assumptions—essential for distinguishing true from false positives in practical vulnerability assessment (Ahmed et al., 26 May 2025). SecVulEval, by design, fulfills the critical need for multifaceted, fine-grained benchmarks supporting rigorous quantification and diagnosis.

2. Dataset Construction and Structure

SecVulEval's central resource is a corpus of 25,440 curated C/C++ functions derived from 5,867 unique CVEs spanning 707 projects (Linux, Chrome, FFmpeg, etc.), harvested from 1999–2024 via the National Vulnerability Database (NVD). Unlike earlier function-only datasets, SecVulEval mandates a single fixing commit per CVE, enabling extraction of matched “before” (vulnerable) and “after” (fixed) function pairs (Ahmed et al., 26 May 2025).

Vulnerable statements—additions, deletions, or modifications in the patch—are explicitly labeled, creating precise ground-truth for localization. Detailed de-duplication, using function text normalization and MD5 hashing, eliminates 3–19 % redundancy previously observed. Non-vulnerable functions total 14,442; vulnerable functions, 10,998. Functions range from 4 up to 500+ lines (median ≈ 44 LOC), exposing models to realistic code size and complexity.

Every sample is enriched with up to five categories of contextual information (following Risse et al.’s taxonomy): function arguments, external function calls, type definitions, global/macros, and execution environment cues. Automated extraction—using GPT-4.1 prompted with CVE and patch details—yields an 82.98 % agreement with human gold-standard annotation (±9.68 % at 95 % CI) (Ahmed et al., 26 May 2025). Most misses involve peripheral symbols in very large functions, indicating this approach captures the key context in the majority of cases.

3. Benchmark Tasks, Metrics, and Methodology

SecVulEval formalizes vulnerability detection as a statement-level localization task. Given a complete function with contextual annotation, a candidate model must:

  1. Detect whether the function is vulnerable,
  2. Identify the exact vulnerable statements, and
  3. Provide minimal correct reasoning per statement.

This shifts the evaluation axis from mere binary classification to true explainable localization. The evaluation protocol employs Precision, Recall, and their harmonic mean F1, calculated as:

  • Precision=TPTP+FP\text{Precision} = \frac{TP}{TP + FP}
  • Recall=TPTP+FN\text{Recall} = \frac{TP}{TP + FN}
  • F1=2Precision×RecallPrecision+RecallF_1 = 2 \cdot \frac{\text{Precision} \times \text{Recall}}{\text{Precision} + \text{Recall}}

where a true positive requires both correct statement selection and minimally correct rationale. Notably, if a model identifies the right function as vulnerable but supplies only defective rationale for the statement, this is penalized as a function-level TP but statement-level FN—emphasizing the necessity for fine-grained explainability (Ahmed et al., 26 May 2025).

To probe LLM capabilities effectively, SecVulEval employs a multi-agent automated evaluation pipeline:

  • Normalization Agent: Tree-sitter parses and normalizes the function and emits its AST.
  • Planning Agent: LLM summarizes the function and formulates a checklist of possible security “red flags.”
  • Context Agent: LLM iteratively requests batches of external symbols necessary to resolve red flags, capped at three iterations.
  • Detection Agent: LLM determines vulnerability status, lists minimal vulnerable statements, and supplies single-line rationales.
  • Validation Agent: LLM double-checks Detection Agent’s output, allowing two back-and-forth correction passes (Ahmed et al., 26 May 2025).

Agents are strictly restricted to JSON output for machine evaluation and focus. Prompts always enforce a defensive security analysis, forbidding exploit engineering.

4. Experimental Findings and Comparative Analysis

Experiments on the 25,440-sample dataset, including a manually validated subset of 300 functions, reveal that all tested LLMs are not yet production-ready for statement-level vulnerability localization in real-world C/C++ (Ahmed et al., 26 May 2025). The best-performing model (Claude-3.7-Sonnet) achieved:

  • Function-level: Precision ≈ 41.9 %, Recall ≈ 75.6 %, F1 ≈ 53.9 %
  • Statement-level: Precision ≈ 15.4 %, Recall ≈ 53.2 %, F1 ≈ 23.8 %

GPT-4.1 scored only 22.4 % F1 at statement-level; open-source LLMs lagged further (7–16 % F1). Noteworthy dynamics include:

  • Closed-source LLMs: High recall (aggressive, many FPs).
  • Open-source LLMs: High precision, low recall (overly cautious).

Analysis highlights critical failure patterns: over-detection of null-dereferences (CWE-476), confusion over use-after-free (CWE-416), unwarranted integer overflow flags (CWE-190), and confusion between sources and sinks. All models performed better on short functions. Precision, recall, and F1 strictly measured statement-level performance, demonstrating that function-level binary outcomes do not reliably indicate a system’s understanding of vulnerabilities.

5. Relation to Complementary Benchmarks and Broader Methodologies

SecVulEval is situated within a broader ecosystem involving code generation evaluation (e.g., CWEval), graph neural network classification (SEGNN), and evaluation of security tools (Peng et al., 14 Jan 2025, Ahmed et al., 2023, Valenza et al., 2020):

  • CWEval offers a combined framework assessing both functional correctness and security of generated code via outcome-driven oracles applied to challenging, well-specified coding tasks (multi-language, multi-CWE, functional and security oracles). CWEval introduces metrics—Functional Correctness Rate (FCR), Vulnerability Detection Rate (VDR), and Combined Security-Functional Score (CSFS)—that could be adopted for more general SecVulEval designs (Peng et al., 14 Jan 2025).
  • SEGNN and CVEFGE represent early attempts at more sophisticated, graph-based vulnerability datasets, providing binary labels for functions and leveraging program structure via control-flow graphs, but lacking the fine-grained, statement-level annotation and context required for SecVulEval’s objectives (Ahmed et al., 2023).
  • RevOK/Scanner Evaluation demonstrates the importance of outcome-driven, adversarial evaluation for tools beyond code itself: evaluating the robustness of vulnerability scanners against client-side XSS attacks rooted in the scanner’s processing of untrusted responses (Valenza et al., 2020).

These benchmarks converge on several principles recommended for SecVulEval: dynamic oracles, high-quality and context-intensive specifications, fine-grained labeling with rationale, and comprehensive metric side-by-side reporting (FCR, VDR, CSFS, F1, etc.).

6. Limitations and Future Directions

SecVulEval’s main constraints include:

  • Language Scope: Limited to C/C++; gaps remain for Rust, Go, and other critical systems languages (Ahmed et al., 26 May 2025).
  • Commit-level Extraction: Multi-commit/complex patches are excluded; some real-world vulnerabilities are fixed via extended, distributed refactoring.
  • Context Extraction: Automated GPT-4.1 based context extraction, though accurate in aggregate, may omit key symbols for complex functions; hybrid static analysis and human refinement are potential avenues for enhancement.
  • Manual Evaluation: Statement-level scoring on 300 manually validated functions suggests future scaling via larger, crowdsourced, or semi-automated auditing would solidify the benchmark.
  • Sequence Reasoning: Models struggle with vulnerabilities spanning complex control/data-flow chains or demanding deep symbolic analysis.

Future enhancements proposed include extending to more languages (multi-language), integrating full repository-level call graphs, cross-project validation, integration with symbolic or taint analysis, and the introduction of vulnerability-repair or code “patch suggestion” tasks at statement granularity (Ahmed et al., 26 May 2025).

7. Significance for Security and LLM Research

SecVulEval’s rigorously constructed dataset and multi-agent pipeline set a new high bar for measuring LLMs and related tools in terms of true security vulnerability understanding, explainability, and localization. The framework quantifies fundamental performance limitations in state-of-the-art LLMs: even models with high global accuracy fall short on fine-grained vulnerability localization and rationalization, especially as code length and structural complexity increase.

By focusing on realistic, deduplicated, context-rich, and statement-localized ground truths, SecVulEval directs the research agenda toward models capable of detailed, trustworthy security reasoning—beyond pattern matching or summarization. It thus provides an indispensable resource for comparative benchmarking, ablation studies, and longitudinal progress tracking in software security and automated code analysis (Ahmed et al., 26 May 2025).

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to SecVulEval.