PythonSecurityEval Benchmark

Updated 29 January 2026

PythonSecurityEval is a comprehensive benchmarking suite that evaluates security-centric and general code quality properties in Python generated by LLMs.
It employs automated tools like Bandit and Pylint for fine-grained static analysis and severity-weighted scoring to identify vulnerabilities and quality issues.
The dataset supports feedback-driven security patching, enabling iterative refinement and detailed assessments of improvements in security, readability, and maintainability.

PythonSecurityEval is a comprehensive benchmarking suite for the evaluation of security-centric and general code quality properties in Python generated by LLMs. Designed to fill significant gaps left by conventional datasets focused primarily on functional correctness, PythonSecurityEval combines zero-shot, security-relevant natural language prompts, fine-grained static analysis labeling, and curated severity-weighted scoring mechanisms. The dataset is a central pillar for recent research on LLM-driven feedback, code refinement, and supply chain security detection in the Python ecosystem (Blyth et al., 20 Aug 2025, Alrashedy et al., 2023, Ryan et al., 14 Dec 2025, &&&3&&&).

1. Dataset Construction and Coverage

PythonSecurityEval originated as a large-scale, security-focused benchmark distinct from prior datasets (such as HumanEval or MBPP). Its construction was motivated by the absence of real-world, zero-shot security-relevant tasks suitable for LLM evaluation and patching.

Source and Sample Selection: Security-relevant prompts were mined from Stack Overflow, filtering for accepted answers applying Python modules in domains such as sqlite3, flask, subprocess, os, requests, pymongo, sqlalchemy, rsa, and hashlib. Each entry comprises a natural-language problem statement and corresponding function signature. Prompts specifying "write secure code" were excluded to enforce an unconstrained, realistic evaluation scenario (Alrashedy et al., 2023).
Statistics and Domains: The dataset contains 470 distinct prompt–function pairs (457 with paired unit tests). It systematically covers:
- System/OS-level: 66.6%
- Computation: 35.7%
- Network/HTTP/URLs: 31.3%
- Cryptography: 6.2%
- Database: 24.3%
- General-purpose: 88.1%
- Web frameworks: 9.1%
Vulnerability Types: Automated Bandit analysis surfaces vulnerabilities corresponding to key CWEs, such as CWE-259 (hard-coded password), CWE-400 (uncontrolled resource consumption), CWE-78 (command injection), CWE-89 (SQL injection), CWE-22 (path traversal), and others (raw distributions given in Figure 1 of (Alrashedy et al., 2023)).

2. Labeling, Annotation, and Dimension Taxonomy

PythonSecurityEval emphasizes comprehensive multi-dimensional static analysis beyond mere correctness:

Automated Static Analysis:
- Bandit is the primary security analysis tool, flagging vulnerabilities with severity and confidence scores mapped to CWEs. Every generated code snippet is passed through Bandit's >60 plugins.
- Pylint assesses the following code quality aspects:
- Convention (C): Readability (55 checks)
- Warning (W): Reliability (155 checks)
- Error (E): Correctness/Bugs (127 checks)
- Refactor (R): Maintainability (76 checks)
- Information (I): Metadata (9 checks)
- Table 1 (Blyth et al., 20 Aug 2025) maps these aspects directly to analysis tools.
Labeling Protocol:
- All issues reported by Bandit and Pylint (across the full set of categories and severities) are retained as ground-truth labels; no manual relabeling at the issue level.
- Bandit flags each security issue with confidence and severity in {UNDEFINED, LOW, MEDIUM, HIGH}. Pylint categories are collected from out-of-the-box executions.
Expert Consensus:
- Human experts only set category weights (for fitness scoring, see Section 4) and infer missing function signatures when omitted in prompts (these are recorded in the dataset metadata) (Blyth et al., 20 Aug 2025).

3. Schema, Format, and Access

Data is delivered as a JSON Lines file (python_security_eval.jsonl) using a transparent, extensible schema:

Field	Type	Description
prompt_id	string	Unique identifier
nl_prompt	string	Natural-language problem statement
test_suite	array	Pairs of {input, expected_output} for unit testing
model	string	E.g., "gpt-4o"
issuesSelected	int or string	Issues to select per iteration
iteration	int	Refinement round (0 = initial)
code	string	Generated Python code sample
passed_all_tests	bool	Functional correctness indicator
fitness_score	float or –∞	Weighted score (see below)
total_severity	int	Weighted sum δ(S) of issue severities
bandit_issues	array	Each: {line, test, severity, confidence, description}
pylint_issues	array	Each: {line, code, category, message}
codeql_issues	array (optional)	Supplemental security findings

All records can be loaded for analysis with standard Python tooling:

1 2	import json records = [json.loads(l) for l in open("python_security_eval.jsonl")]

The repository is available at https://github.com/Kamel773/LLM-code-refine under the MIT license (Blyth et al., 20 Aug 2025).

4. Scoring Metrics and Evaluation Protocols

PythonSecurityEval introduces a severity-weighted scoring framework to quantitatively assess LLM outputs beyond pass/fail correctness:

Severity-Weighted Sum:
- Security HIGH: 30, MEDIUM: 20, LOW/UNDEFINED: 10
- Convention/Error/Warning/Refactor: 3
Fitness Score:

$f(S) = \begin{cases} -\delta(S) & \text{if } S \text{ passes all tests}\ -\infty & \text{otherwise} \end{cases}$

Vulnerability Rate:

$VR = \frac{1}{N} \sum_{i=1}^N \mathbf{1}[\delta(y_i) \neq \emptyset]$

where $\delta(y_i)$ denotes Bandit's report on snippet $i$ .

Relative Improvement:

$R_{\text{FDSP}} = \frac{VR_{\text{init}} - VR_{\text{FDSP}}}{VR_{\text{init}}} \times 100\%$

Promotes transparent comparison of patching/repair systems (Alrashedy et al., 2023).

Refinement Protocols:
- For each prompt, LLMs generate an initial code sample ( $y_0$ ) and, if desired, iteratively refine it using static analysis feedback (static analysis as a feedback loop or FDSP, see Section 5).

5. Empirical Insights and Quality Distributions

Comprehensive statistics highlight the initial vulnerability of LLM outputs (as generated by GPT-4o) and the impact of static analysis-driven refinement:

Category	Initial (%)	After 10 Iterations (%)	Δ (improvement)
Security	>40	13.4	–26.6
Readability (C)	>80	18.2	–61.8
Reliability (W)	>50	12.8	–37.2
Maintainability	~15	2.2	–12.8
Errors (E)	~20	15.6	–4.4

Functional correctness (passing all tests) increases by approximately 10 percentage points across issue selection strategies (Table 3, (Blyth et al., 20 Aug 2025)).
The fraction of code snippets exhibiting at least one security flaw drops from >40% to 13%; readability violations from >80% to 18.2%, demonstrating systematic uplift in multi-dimensional code quality.
Empirically, LLMs guided by this feedback can repair a majority of security, readability, and reliability flaws while also improving maintainability and test pass rates.

6. Use Cases and Integration with Feedback-Driven Security Patching

PythonSecurityEval supports reproducible zero-shot benchmarking, automated security patching evaluation, and static analysis research:

Integration with FDSP:

As the canonical zero-shot benchmark, PythonSecurityEval underpins Feedback-Driven Security Patching (FDSP), where LLMs generate, analyze, and iteratively repair vulnerable code. FDSP, as described in (Alrashedy et al., 2023), outperforms self-feedback methods by up to 17.6% relative reduction in vulnerability rate, measured directly using this dataset.

General Research Applications:
- Drop-in static analysis benchmark for LLM outputs
- Fine-grained ablations on refinement rounds, issue-selection policies, and prompt configurations
- Automated regression testing for security-mitigating strategies
- Extensible schema enabling integration with dynamic analysis datasets and sequential pattern mining approaches

PythonSecurityEval complements, but is distinct from, several adjacent benchmarks:

Statement-level malicious logic mining (Ryan et al., 14 Dec 2025):

Whereas PythonSecurityEval focuses on LLM code generation and static vulnerability assessment, (Ryan et al., 14 Dec 2025) targets fine-grained annotation of real-world malicious code within Python packages, using a statement-level taxonomy (47 malicious indicators, 7 behavioral types).

Dynamic analysis for supply chain attacks (Mehedi et al., 20 May 2025):

QUT-DV25 extends the static code and prompt approach of PythonSecurityEval by capturing eBPF-based dynamic traces from 14,271 Python packages, facilitating detection of multi-phase malware and covert network activity.

Security commit mining (Sun et al., 2023):

PySecDB catalogues real-world security‑related code commits, using graph neural architectures to identify security fix patterns and augmenting the space of security patches.

A plausible implication is that PythonSecurityEval fills a critical role in bridging LLM-centric code generation and repair with both static and dynamic threat detection workflows for Python.

References:

(Blyth et al., 20 Aug 2025, Alrashedy et al., 2023, Ryan et al., 14 Dec 2025, Mehedi et al., 20 May 2025, Sun et al., 2023)