SAST-Judge: Standardized Evaluation Framework

Updated 1 April 2026

SAST-Judge is a formal evaluation framework that standardizes SAST outputs using unified taxonomies and normalized reporting schemas.
It benchmarks tools with both synthetic and real-world datasets, employing rigorous metrics such as precision, recall, F₁ score, and MCC.
The framework integrates hybrid LLM methods and causal inference techniques to enhance tool validation and address judicial IV design challenges.

SAST-Judge refers to a class of formal evaluation frameworks and methodologies for benchmarking, validating, and harmonizing the outputs of Static Application Security Testing (SAST) systems, including classic analyzers, LLM-based agents, hybrid pipelines, and instrumental variable (IV) identification strategies in judicial designs. The SAST-Judge concept spans initiatives in software security, program analysis, and causal inference, where rigor and comparability across tools, work streams, or experimental arms are critical. Its canonical instantiations emphasize unified taxonomies, standardized reporting, robust metrics, and sharp statistical testing.

1. Unified Taxonomy and Output Normalization

A primary challenge in SAST evaluation is the heterogeneity of vulnerability type definitions and reporting schemas across tools. The SAST-Judge blueprint, exemplified by the VulsTotal platform for Android, mandates manual construction of a unified vulnerability taxonomy. This is derived through comprehensive enumeration and synthesis of alert identifiers, detection logic descriptions, and configuration artifacts across candidate SAST systems. Human analysts review ambiguous cases and resolve granularity mismatches to establish a taxonomy reflecting root-cause vulnerability groupings aligned with industry standards, such as the OWASP Mobile Top 10 (Zhu et al., 2024).

A representative taxonomy as implemented contains 67 leaf vulnerability types grouped under five high-level categories: sensitive data exposure, insufficient encryption, security misconfiguration, insecure code execution, and insecure network configuration. Each finding from any SAST tool is mapped to this taxonomy and reported in a standardized, machine-readable format containing source file, code context, tool provenance, and taxonomy ID, enabling robust cross-tool deduplication and type coverage analysis.

Category	Example Types
Sensitive Data Exposure	Logging Data Exposure, Hardcoded Sensitive Data
Insufficient Encryption	Improper Symmetric Encryption, Use Insecure Random
Security Misconfiguration	Insecure Component SDK Usage, Faulty Exported Component
Insecure Code Execution	WebView Code Execution, Dynamic Class Loading
Insecure Network Configuration	Absence of TLS, Hardcoded CA

Unification of tool outputs through engineered parser scripts and mapping databases is essential for fair quantitative benchmarking.

2. Benchmark Construction and Evaluation Metrics

Accurate SAST comparison requires both synthetic and real-world vulnerability benchmarks. SAST-Judge platforms integrate synthetic suites (e.g., GHERA, MSTG, PIVAA) with curated CVE-based datasets derived from exhaustive manual labeling of publicly reported vulnerabilities and acquisition of corresponding real-world artifacts (e.g., APKs in Android studies) (Zhu et al., 2024). This dual approach reveals differential tool performance on pattern-driven versus semantically intricate casework. Benchmarks are constructed to maximize coverage of the normalized taxonomy, reflecting practical distributions in contemporary codebases and enabling comprehensive recall/capacity testing.

Evaluation metrics are standardized:

Precision: $TP/(TP+FP)$
Recall: $TP/(TP+FN)$
F₁ Score: $2 · (\text{Precision} · \text{Recall}) / (\text{Precision} + \text{Recall})$
B_Recall: $\mathrm{B\_Recall} = \#\mathrm{CorrectlyIdentifiedVulns} / \#\mathrm{AllKnownVulnsInBenchmark}$ (for benchmarks with only positive instances)
Time Performance: mean analysis runtime per target, failure/event counts

For SAST triage, additional metrics include Matthews Correlation Coefficient (MCC), F₂ (recall-weighted), accuracy, and explicit triage cost models accounting for human labor and remediation risk, with cost functions prioritized according to false negative and false positive operational impacts (Feiglin et al., 6 Jan 2026).

3. SAST-Judge Architectural Principles

The SAST-Judge architecture comprises modular managers for tool invocation, report normalization, taxonomy versioning, benchmark orchestration, metrics computation, and interactive reporting. A minimal batch evaluation comprises:

Unified CLI wrappers and tool managers
Output normalization via per-tool parsers
Taxonomy mapping stored as versioned JSON/DB
Hybrid benchmark managers handling both synthetic and real-world datasets
Evaluators executing direct label-to-finding joins for metric calculation
Dashboards/tables reporting coverage, recall, and time performance matrices

Pseudocode for a batch experiment exhibits deterministic outer-loop assignment of tools to artifacts, normalization, unified finding storage, and iterative benchmark evaluation (Zhu et al., 2024).

4. SAST-Judge in Triage and LLM Hybrid Systems

Recent evolution positions SAST-Judge as a hybrid feedback pipeline, orchestrating high-precision SAST tool invocation with secondary LLM-based agents for type inference, CWE mapping, or triage of ambiguous findings (Adnan et al., 4 Jan 2026, Feiglin et al., 6 Jan 2026). Benchmarks such as SastBench operationalize this as a two-stage filter:

Lightweight heuristics isolate trivial false positives via classic SAST.
Agentic LLMs, invoked with evidence-focused ReAct prompts and security-anchored process instructions, adjudicate hard negatives.

Empirical results demonstrate that domain-specific prompts and fallback strategies (e.g., fall back to syntactic grep when AST tools fail) significantly enhance F₁ and MCC. Under the extreme FP/TP imbalance (SastBench: 8.15:1), performance is primarily assessed with MCC and cost-weighted metrics, with recommendations to optimize for F₂ where recall dominates (Feiglin et al., 6 Jan 2026).

Agent	Accuracy	Precision	Recall	MCC
Gemini 2.5 Pro (ReAct)	0.64	0.17	0.58	0.148
Claude 4.5 (ReAct)	0.48	0.14	0.72	0.110
Llama 4 Maverick	0.68	0.10	0.23	–0.020

Hybrid approaches are recommended to exploit SAST’s high precision where available and compensate recall via LLM inference or voting ensembles (Adnan et al., 4 Jan 2026).

5. Hierarchical, CWE-Aware Penalty Evaluation

ALPHA distinguishes SAST-Judge among recent evaluation proposals through its penalty function on CWE hierarchy-aware predictions (Adnan et al., 4 Jan 2026). Errors are decomposed as over-generalization, lateral mismatch, or over-specification:

Penalty Function:

$P(c_{\text{pred}},c_{\text{true}}) = d(c_{\text{pred}},c_{\text{true}}) \times \alpha(c_{\text{pred}},c_{\text{true}})$

with $\alpha_{\text{up}}=2.0$ (over-generalization), $\alpha_{\text{lateral}}=1.8$ , and $\alpha_{\text{down}}$ adapting to ground-truth subtree depth.

Aggregate Score:

$\mathrm{ALPHA} = \frac{1}N\sum_{i=1}^N P_i$

Lower ALPHA indicates better CWE-specificity and feedback utility.

Empirical evaluation demonstrates LLMs systematically achieve lower ALPHA on human-annotated data than SAST, while SAST outpaces LLMs in precision when detections occur. Consistency metrics (perfect and majority agreement across LLM runs) serve as safeguards against brittle advice in iterative developer feedback systems. The use of dual-head LLM models—merging token-level and classification heads with joint loss functions weighted by the normalized ALPHA penalty—has been proposed to further enhance CWE granularity and prediction stability.

6. SAST-Judge in Causal Inference and IV Validation

In the econometric literature, SAST-Judge also denotes a "sharp" specification test for the judge leniency IV design (Coulibaly et al., 2024). Here, SAST-Judge leverages the observable distributional implications (as inequalities over treated/untreated expectations conditional on instrument propensity) to assess the validity of random assignment, exclusion, and monotonicity, using test statistics based on gridwise unconditional moments. Under violations, identification can be salvaged via partial monotonicity and exclusion, recovering local marginal treatment effects. Simulation studies show the sharp SAST-Judge test outperforms non-sharp alternatives and empirically detects subtle design violations in real-world judicial data.

7. Adoption, Operationalization, and Best Practices

Successful SAST-Judge implementation necessitates:

Investment in unified taxonomies and reporting normalization
Continuous benchmark updates with recently discovered CVEs for realism and data-leakage avoidance
Hybridization with LLMs and agentic workflows supporting actionable, single-label predictions enforced via prompting or post-processing
Metric-driven evaluation, emphasizing operational cost metrics (e.g., triage cost) and MCC in imbalanced settings
Integration as policy gates in CI/CD pipelines, dashboarding of high-risk counts, false positive rates, and average remediation delays

Potential pitfalls include underestimating the labor for taxonomy alignment; oversampling synthetic, non-representative benchmarks; or failure to monitor model and tool failure modes (e.g., parser or decompiler errors leading to silent FNs) (Zhu et al., 2024, Feiglin et al., 6 Jan 2026).

By adhering to SAST-Judge principles, evaluation frameworks can support reproducible, fair, and extensible benchmarking of security analysis systems across both program analysis and econometric applications.