PythonSecurityEval Benchmark
- PythonSecurityEval is a comprehensive benchmarking suite that evaluates security-centric and general code quality properties in Python generated by LLMs.
- It employs automated tools like Bandit and Pylint for fine-grained static analysis and severity-weighted scoring to identify vulnerabilities and quality issues.
- The dataset supports feedback-driven security patching, enabling iterative refinement and detailed assessments of improvements in security, readability, and maintainability.
PythonSecurityEval is a comprehensive benchmarking suite for the evaluation of security-centric and general code quality properties in Python generated by LLMs. Designed to fill significant gaps left by conventional datasets focused primarily on functional correctness, PythonSecurityEval combines zero-shot, security-relevant natural language prompts, fine-grained static analysis labeling, and curated severity-weighted scoring mechanisms. The dataset is a central pillar for recent research on LLM-driven feedback, code refinement, and supply chain security detection in the Python ecosystem (Blyth et al., 20 Aug 2025, Alrashedy et al., 2023, Ryan et al., 14 Dec 2025, &&&3&&&).
1. Dataset Construction and Coverage
PythonSecurityEval originated as a large-scale, security-focused benchmark distinct from prior datasets (such as HumanEval or MBPP). Its construction was motivated by the absence of real-world, zero-shot security-relevant tasks suitable for LLM evaluation and patching.
- Source and Sample Selection: Security-relevant prompts were mined from Stack Overflow, filtering for accepted answers applying Python modules in domains such as
sqlite3,flask,subprocess,os,requests,pymongo,sqlalchemy,rsa, andhashlib. Each entry comprises a natural-language problem statement and corresponding function signature. Prompts specifying "write secure code" were excluded to enforce an unconstrained, realistic evaluation scenario (Alrashedy et al., 2023). - Statistics and Domains: The dataset contains 470 distinct prompt–function pairs (457 with paired unit tests). It systematically covers:
- System/OS-level: 66.6%
- Computation: 35.7%
- Network/HTTP/URLs: 31.3%
- Cryptography: 6.2%
- Database: 24.3%
- General-purpose: 88.1%
- Web frameworks: 9.1%
- Vulnerability Types: Automated Bandit analysis surfaces vulnerabilities corresponding to key CWEs, such as CWE-259 (hard-coded password), CWE-400 (uncontrolled resource consumption), CWE-78 (command injection), CWE-89 (SQL injection), CWE-22 (path traversal), and others (raw distributions given in Figure 1 of (Alrashedy et al., 2023)).
2. Labeling, Annotation, and Dimension Taxonomy
PythonSecurityEval emphasizes comprehensive multi-dimensional static analysis beyond mere correctness:
- Automated Static Analysis:
- Bandit is the primary security analysis tool, flagging vulnerabilities with severity and confidence scores mapped to CWEs. Every generated code snippet is passed through Bandit's >60 plugins.
- Pylint assesses the following code quality aspects:
- Convention (C): Readability (55 checks)
- Warning (W): Reliability (155 checks)
- Error (E): Correctness/Bugs (127 checks)
- Refactor (R): Maintainability (76 checks)
- Information (I): Metadata (9 checks)
- Table 1 (Blyth et al., 20 Aug 2025) maps these aspects directly to analysis tools.
- Labeling Protocol:
- All issues reported by Bandit and Pylint (across the full set of categories and severities) are retained as ground-truth labels; no manual relabeling at the issue level.
- Bandit flags each security issue with confidence and severity in {UNDEFINED, LOW, MEDIUM, HIGH}. Pylint categories are collected from out-of-the-box executions.
- Expert Consensus:
- Human experts only set category weights (for fitness scoring, see Section 4) and infer missing function signatures when omitted in prompts (these are recorded in the dataset metadata) (Blyth et al., 20 Aug 2025).
3. Schema, Format, and Access
Data is delivered as a JSON Lines file (python_security_eval.jsonl) using a transparent, extensible schema:
| Field | Type | Description |
|---|---|---|
| prompt_id | string | Unique identifier |
| nl_prompt | string | Natural-language problem statement |
| test_suite | array | Pairs of {input, expected_output} for unit testing |
| model | string | E.g., "gpt-4o" |
| issuesSelected | int or string | Issues to select per iteration |
| iteration | int | Refinement round (0 = initial) |
| code | string | Generated Python code sample |
| passed_all_tests | bool | Functional correctness indicator |
| fitness_score | float or –∞ | Weighted score (see below) |
| total_severity | int | Weighted sum δ(S) of issue severities |
| bandit_issues | array | Each: {line, test, severity, confidence, description} |
| pylint_issues | array | Each: {line, code, category, message} |
| codeql_issues | array (optional) | Supplemental security findings |
All records can be loaded for analysis with standard Python tooling:
1 2 |
import json records = [json.loads(l) for l in open("python_security_eval.jsonl")] |
4. Scoring Metrics and Evaluation Protocols
PythonSecurityEval introduces a severity-weighted scoring framework to quantitatively assess LLM outputs beyond pass/fail correctness:
- Severity-Weighted Sum:
- Security HIGH: 30, MEDIUM: 20, LOW/UNDEFINED: 10
- Convention/Error/Warning/Refactor: 3
- Fitness Score:
- Vulnerability Rate:
where denotes Bandit's report on snippet .
- Relative Improvement:
Promotes transparent comparison of patching/repair systems (Alrashedy et al., 2023).
- Refinement Protocols:
- For each prompt, LLMs generate an initial code sample () and, if desired, iteratively refine it using static analysis feedback (static analysis as a feedback loop or FDSP, see Section 5).
5. Empirical Insights and Quality Distributions
Comprehensive statistics highlight the initial vulnerability of LLM outputs (as generated by GPT-4o) and the impact of static analysis-driven refinement:
| Category | Initial (%) | After 10 Iterations (%) | Δ (improvement) |
|---|---|---|---|
| Security | >40 | 13.4 | –26.6 |
| Readability (C) | >80 | 18.2 | –61.8 |
| Reliability (W) | >50 | 12.8 | –37.2 |
| Maintainability | ~15 | 2.2 | –12.8 |
| Errors (E) | ~20 | 15.6 | –4.4 |
- Functional correctness (passing all tests) increases by approximately 10 percentage points across issue selection strategies (Table 3, (Blyth et al., 20 Aug 2025)).
- The fraction of code snippets exhibiting at least one security flaw drops from >40% to 13%; readability violations from >80% to 18.2%, demonstrating systematic uplift in multi-dimensional code quality.
- Empirically, LLMs guided by this feedback can repair a majority of security, readability, and reliability flaws while also improving maintainability and test pass rates.
6. Use Cases and Integration with Feedback-Driven Security Patching
PythonSecurityEval supports reproducible zero-shot benchmarking, automated security patching evaluation, and static analysis research:
- Integration with FDSP:
As the canonical zero-shot benchmark, PythonSecurityEval underpins Feedback-Driven Security Patching (FDSP), where LLMs generate, analyze, and iteratively repair vulnerable code. FDSP, as described in (Alrashedy et al., 2023), outperforms self-feedback methods by up to 17.6% relative reduction in vulnerability rate, measured directly using this dataset.
- General Research Applications:
- Drop-in static analysis benchmark for LLM outputs
- Fine-grained ablations on refinement rounds, issue-selection policies, and prompt configurations
- Automated regression testing for security-mitigating strategies
- Extensible schema enabling integration with dynamic analysis datasets and sequential pattern mining approaches
7. Relations and Distinctions to Related Benchmarks
PythonSecurityEval complements, but is distinct from, several adjacent benchmarks:
- Statement-level malicious logic mining (Ryan et al., 14 Dec 2025):
Whereas PythonSecurityEval focuses on LLM code generation and static vulnerability assessment, (Ryan et al., 14 Dec 2025) targets fine-grained annotation of real-world malicious code within Python packages, using a statement-level taxonomy (47 malicious indicators, 7 behavioral types).
- Dynamic analysis for supply chain attacks (Mehedi et al., 20 May 2025):
QUT-DV25 extends the static code and prompt approach of PythonSecurityEval by capturing eBPF-based dynamic traces from 14,271 Python packages, facilitating detection of multi-phase malware and covert network activity.
- Security commit mining (Sun et al., 2023):
PySecDB catalogues real-world security‑related code commits, using graph neural architectures to identify security fix patterns and augmenting the space of security patches.
A plausible implication is that PythonSecurityEval fills a critical role in bridging LLM-centric code generation and repair with both static and dynamic threat detection workflows for Python.
References:
(Blyth et al., 20 Aug 2025, Alrashedy et al., 2023, Ryan et al., 14 Dec 2025, Mehedi et al., 20 May 2025, Sun et al., 2023)