JUDGEJS: Benchmarking LLM Vulnerability Detection

Updated 8 December 2025

JUDGEJS is an automated evaluation framework for JavaScript vulnerability detection that standardizes LLM benchmarking by aligning predictions with rigorous ground-truth criteria.
It integrates modules for model invocation, output normalization, label alignment, and metrics computation to assess over 218 CWE types across multiple project variants.
Empirical results reveal that while project-level F₁ scores are moderate, function-level detection and robustness under adversarial conditions remain significant challenges.

JUDGEJS is an automatic evaluation framework developed for the comprehensive benchmarking and assessment of LLMs in JavaScript vulnerability detection. Situated within the SecJS vulnerability detection pipeline, JUDGEJS standardizes evaluation workflows, aligns model predictions against rigorous ground-truth criteria, and produces multidimensional metrics relevant for research and industrial reliability. The framework enforces methodological principles specifically designed to mitigate shortcomings found in prior benchmarking approaches: (1) coverage over 218 distinct CWE types and multiple project variants, (2) prevention of underestimation through semantic equivalence classes and denoised splits, and (3) controls against overestimation by requiring holistic repository-level reasoning. JUDGEJS is also cited as a candidate for integrating advanced trichotomous reasoning pipelines relevant to legal judgment prediction tasks (Fei et al., 1 Dec 2025, Zhang et al., 19 Dec 2024).

1. Role of JUDGEJS in JavaScript Vulnerability Detection Workflows

JUDGEJS operates as the terminal evaluation stage in the SecJS pipeline. Upstream, ForgeJS mines code repositories (sourced from CVE and Mend.io), extracting vulnerable/fixed code pairs, generating dataset variants (including adversarial perturbations), and composing the ARENAJS benchmark of 1,437 JavaScript vulnerability cases. JUDGEJS processes each benchmark entry by querying candidate LLMs, collecting model-generated vulnerability reports, and rigorously matching these against ground-truth labels. The evaluation is performed at both project and function-level granularities, aggregating results per model, per dataset-variant, and per context (frontend, backend, full-stack).

2. Architectural Modules and Data Flow

The architecture of JUDGEJS consists of four core modules:

Model Invocation: Interfaces with commercial LLM APIs (OpenAI, Anthropic, Gemini), constructs context-rich system prompts detailing all CWE types and entire project payloads. Implements a confidence-threshold—typically set at 0.8—to filter out model outputs below reliability budgets.
Output Parsing & Normalization: Receives LLM outputs in JSON format, extracts standardized fields (file, line, severity, CWE-ID, description, exploit_scenario, recommendation), and normalizes textual fields (removing path prefixes, lowercasing).
Label Alignment & Matching: Matches predicted vulnerabilities to ground truth via CWE equivalence (grouping by CAPEC families), fuzzy file and function name normalization, and aggregated confusion-matrix logic.
Metrics Computation & Reporting: Calculates standard precision, recall, F₁, robustness curves under adversarial variants, and VD-S (Vulnerability Detection Score), subject to industrial false-positive constraints. Metrics are summarized and visualized for comparative analysis.

3. Methodological Principles and Evaluation Metrics

JUDGEJS enforces three foundational evaluation principles:

Comprehensiveness: Supports 218 CWEs, assesses both function-level (localization) and project-level (global binary classification + CWE correctness), covers frontend, backend, and full-stack contexts, and evaluates across five dataset variants: Original, Noise, Obfuscated, Noise + Obfuscated, Prompt-Injection.
No Underestimation: Treats semantically equivalent vulnerabilities as matched (CWE equivalence classes), filters label noise with denoised splits, and maximizes LLM performance via multi-step security-review prompting.
No Overestimation: Operates at the full-repository level (not isolated snippets), evaluates on both vulnerable and fixed versions to quantify false positives, and uses adversarial augmentation to stress-test heuristics.

Metrics include precision ( $\frac{TP}{TP+FP}$ ), recall ( $\frac{TP}{TP+FN}$ ), F₁-score ( $\frac{2PR}{P+R}$ ), and the VD-S metric under specified false-positive rate constraints ( $\leq 0.5\%$ ): $\mathrm{VD\!-\!S}(r) = \mathrm{FNR} \mid \mathrm{FPR} \le r, \qquad \mathrm{FPR}=\frac{FP}{FP+TN}, \, \mathrm{FNR}=\frac{FN}{TP+FN}$ A further breakdown compares the project and function-level performance across LLMs and robustness under adversarial conditions.

4. Key Heuristics and Algorithms

JUDGEJS leverages multiple matching and alignment heuristics:

CWE Equivalence: Two CWEs $c_1, c_2$ are equivalent ( $c_1 \equiv c_2$ ) if they coincide or belong to the same MITRE CAPEC family: $c_1 = c_2 \vee \exists g \in \mathcal{G} : \{c_1, c_2\} \subseteq g$ .
Fuzzy File/Function Matching: File paths match if their basenames are equal; function names are compared via normalized sets requiring nonempty intersection.
Diff-to-Function Mapping: Utilizes line-level diffs (via difflib.SequenceMatcher) to map vulnerability locations, which are leveraged during evaluation.
Confidence-Threshold Tuning: Adjusts internal model confidence cutoffs to conform to FPR budget, ensuring reported metrics are robust to over-prediction.

5. Evaluation Findings and Limitations

Empirical evaluation of seven commercial LLMs (GPT-5, Claude-4.5, Gemini-Pro, etc.) on ARENAJS using JUDGEJS demonstrates:

Model	Project-Level F₁ (Original)	Function-Level F₁ (Original)	F₁ (Noise+Obfuscation)	VD-S (FPR ≤ 0.5%)
GPT-5	32.1%	~14–24% lower	7.8%	> 60–80%
Claude-4.5	35.9%	~8–18pp lower	4.2%	> 60–80%

Models show substantially better project-level F₁ than function-level F₁, indicating reliance on superficial context (file structure, imports, comments) over true vulnerability reasoning and data flow. Robustness to adversarial perturbations is poor; F₁ drops precipitously under noise and obfuscation variants. Under the industrial FPR constraint ( $\leq 0.5\%$ ), the majority of vulnerabilities are missed—VD-S scores exceed 60–80%—rendering these models impractical for deployment in sensitive settings.

JUDGEJS also highlights residual evaluation challenges: label noise remains even in denoised splits, CWE grouping is not perfect, and reliance on single-pass LLM outputs precludes beam search or human-in-the-loop correction. Contamination is possible if models have seen test projects during pretraining.

6. Researcher Workflow and Practical Usage

JUDGEJS is deployed via an accessible Python API, enabling batch evaluation and reporting:

from secjs.judge import JudgeJS
dataset = ArenaJS.load(variant='Original', split='complete')
judge = JudgeJS(model='gpt-5', api_key='…', confidence=0.8)
results = judge.evaluate(dataset)
judge.report(results, out='gpt5_original.csv')

Researchers repeat this workflow across variants and models, leveraging built-in visualization tools to track performance degradation under various adversarial scenarios.

7. Future Directions and Cross-Domain Extensions

JUDGEJS is explicitly positioned to benefit from integration with advanced reasoning pipelines such as the trichotomous dogmatics framework found in legal judgment prediction (Zhang et al., 19 Dec 2024). Incorporating multi-stage reasoning and balanced augmentation of “non-guilty” counterfactuals, as recommended for legal judgment models, may improve the ability of JUDGEJS to distinguish genuine vulnerabilities from benign code, particularly where current models exhibit strong “guilty” priors and lack awareness of exculpatory signals.

A plausible implication is that future iterations of JUDGEJS will require not only more robust data alignment and evaluation criteria but also explicit multi-factor reasoning and counterfactual modeling to reach reliability levels necessary for industrial and regulatory contexts.