Papers
Topics
Authors
Recent
Search
2000 character limit reached

SastBench: Agent-Agnostic SAST Triage Benchmark

Updated 7 January 2026
  • SastBench is an agent-agnostic benchmark that defines automated triage for static analysis tools by evaluating agent decisions on discerning true vulnerabilities from false alerts.
  • The dataset combines human-validated true positives from recent CVEs with semgrep-generated false positives, simulating an 8:1 noise ratio across multiple programming languages.
  • Its evaluation framework supports diverse LLM-driven agent architectures and measures performance using precision, recall, F1, and MCC to guide improvements in automated SAST triage.

SastBench is an agent-agnostic benchmark developed to rigorously evaluate the ability of autonomous, LLM-driven agents to triage findings from Static Application Security Testing (SAST) tools. In contemporary defensive cybersecurity, lightweight SAST tools are indispensable for detecting potential software vulnerabilities but suffer from a prohibitively high false positive rate—studies indicate that over 90% of SAST-reported alerts are ultimately non-exploitable—thereby necessitating substantial manual triage by security analysts. SastBench addresses the lack of realistic, public benchmarks for end-to-end agentic evaluation of SAST triage, capturing both the data distribution encountered in real-world SAST workflows and the agent-driven, tool-integrated methodologies increasingly employed for automation (Feiglin et al., 6 Jan 2026).

1. Triage Problem Definition and Motivation

SastBench formalizes the SAST triage task as follows: given a specific commit of a code repository and a collection of flagged code instances grouped by CWE (Common Weakness Enumeration), an agent must decide, for each group, whether it constitutes a real vulnerability (true positive) or represents an erroneous (false positive) alert. The motivation derives from two industry-critical observations: (i) SAST tool alerts produce an overwhelming volume of noise that burdens practitioners and (ii) the absence of benchmarks reflecting realistic SAST false positive distributions and supporting end-to-end agent workflows has impeded progress in automated triage.

2. Dataset Construction and Characteristics

SastBench-v1 integrates both true positive and false positive findings to simulate the authentic SAST triage environment:

  • True Positives (TP): Derived from recent CVEs (Common Vulnerabilities and Exposures) from the NVD that reference publicly available code-fix commits on GitHub. Each CVE is annotated with a human-validated CWE label, and only CVEs published after a configurable knowledge cutoff (e.g., February 2025) are retained to preclude memorization by test-time agents.
  • False Positives (FP): Generated by executing the open-source semgrep tool on commit-preceding code states. Any semgrep finding (associated with a CWE) not matching the ground-truth CWE of the CVE is considered negative, with additional heuristics excluding findings in the same function as a validated vulnerability to reduce information leakage.

The SastBench-v1 corpus encompasses 2,737 samples (299 TP, 2,438 FP), reflecting a false-to-true ratio of 8.15:1, and spans 38 programming languages (notably PHP, JavaScript/TypeScript, Python) and 139 unique CWEs. The distribution of languages and weaknesses aligns closely with industrial SAST tool output, including widespread web and memory-safety vulnerabilities (e.g., CWE-79: XSS, CWE-89: SQL injection).

3. Agent-Agnostic Evaluation Framework

SastBench’s architecture is inspired by recent “agentic” benchmarks (e.g., SWE-Bench, Terminal-Bench) and is agnostic to agent design, enabling fair comparison across diverse auto-triage methods. The evaluation pipeline is as follows:

  • Agents are submitted as Dockerized services, exposing a REST endpoint (/analyze) for interaction.
  • For each target repository/commit, the framework checks out the code, passes the agent a JSON-formatted list of instances (including file paths, line ranges, contextual code, and CWE labels).
  • The agent returns binary triage decisions—"true_positive" or "false_positive"—for each instance.
  • Agents may use arbitrary supporting tools or prompts and may feature workflows with a commit-level preprocessing stage as well as instance-level reasoning.

This design supports various agent architectures, including single-prompt LLMs, ReAct-style reasoning loops, tool-augmented frameworks, and domain-generalist models, while ensuring consistent experimental conditions.

4. Evaluation Metrics and Methodology

Triage performance in SastBench is quantified via standard classification metrics, adapted to the heavily imbalanced dataset:

Metric Formula Role
Precision (PP) P=TPTP+FPP = \frac{TP}{TP + FP} True positive rate among predicted positives
Recall (RR) R=TPTP+FNR = \frac{TP}{TP + FN} True positive rate over all actual positives
F₁ score F1=2PRP+RF_1 = \frac{2PR}{P + R} Harmonic mean, balances precision and recall
F₂ score F2=5PR4P+RF_2 = \frac{5PR}{4P + R} Weights recall more heavily (reflects cost of misses)
Matthews Corr. Coefficient MCC=TPTNFPFN(TP+FP)(TP+FN)(TN+FP)(TN+FN)MCC = \frac{TP \cdot TN - FP \cdot FN}{\sqrt{(TP+FP)(TP+FN)(TN+FP)(TN+FN)}} Robust correlation for class-imbalance, primary indicator

Accuracy is also reported but receives less emphasis due to the high FP:TP ratio in the dataset, which would otherwise obscure meaningful differences.

5. Experimental Results and Workflow Variants

SastBench was employed to assess a spectrum of LLM-based agents—including Gemini 2.5 Flash/Pro, Claude Sonnet 4.5, Qwen3 Coder 480B, GPT-OSS 120B, DeepSeek-R1, and Llama 4 Maverick 17B—across several workflow paradigms:

  • No-Tools Baseline: Chain-of-Thought style prompt on code snippets.
  • Simple ReAct Agent: Iterative Reason-and-Act using file reading and grep tools.
  • Improved Prompt ReAct Agent: Incorporates security-expert heuristics into the reasoning loop.
  • Generalist Agents: Architectures like mini-SWE-Agent and OpenHands, leveraging their own toolchains.

Notable findings include:

  • Gemini 2.5 Pro with improved-prompt ReAct achieved MCC = 0.148, precision = 0.169, recall = 0.582, F₁ = 0.262.
  • Claude Sonnet 4.5 with the improved prompt emphasized recall (0.722) at somewhat lower precision (0.140), MCC = 0.110.
  • Simple ReAct workflows with weaker LLMs (e.g., Llama 4 Maverick 17B) frequently resulted in negative MCC, indicating performance worse than random on this hard negative set.
  • Gemini 2.5 Pro, even in the no-tools condition, demonstrated strong recall, rivaling or exceeding some ReAct agents and indicating inherent code reasoning capability.
  • Precision-recall plots illustrate a Pareto frontier: stronger LLMs generally dominate across both metrics, but practitioners may choose operating points that emphasize recall (to minimize missed vulnerabilities) or precision (when analyst bandwidth constrains triage throughput).

No current model excels at maximizing both precision and recall, and performance is highly sensitive to prompt, workflow, and model selection.

6. Dataset Analysis and Practical Implications

Analysis of the SastBench corpus indicates that FPs generated via semgrep and subjected to rigorous filtering possess challenging, artifact-resistant properties. Empirical evaluation with simple embedding-based classifiers demonstrates only marginal ability to distinguish negatives of SAST vs. CVE origin, confirming the benchmark’s non-triviality and the necessity for genuine code reasoning rather than shortcut exploitation. The cross-language and cross-CWE coverage supports evaluation of generalization in both dimensions.

A crucial implication is that benchmarking progress on SastBench tracks genuine advances in agentic, reasoning-driven triage methods, not superficial heuristics or dataset artifact exploitation. The authors caution against overfitting to benchmark idiosyncrasies (e.g., simply rerunning semgrep), emphasizing reasoning strategies that parallel human analytic defensibility.

7. Limitations and Future Directions

SastBench is positioned as a reproducible, extensible testbed for automated SAST triage, not a conclusive measure of "triage intelligence." Limitations include residual risk of dataset artifacts, evolving SAST tool behavior, and the potential for future LLM advances to change the relative difficulty of current benchmarks. The rolling-update design, version-tagged and periodically refreshed, enables longitudinal tracking. Prospective enhancements include increased dataset scale, richer language and vulnerability class coverage, integration of dynamic analysis signals, and support for hybrid human-in-the-loop architectures.

In summary, SastBench provides a high-fidelity, agent-agnostic evaluation platform unifying real-world vulnerabilities and SAST-generated alert noise, supporting the rigorous measurement of agentic triage workflows under realistic industrial constraints and guiding future research toward scalable, trustworthy automation of security triage (Feiglin et al., 6 Jan 2026).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to SastBench.