Papers
Topics
Authors
Recent
Search
2000 character limit reached

PythonSecurityEval Benchmark

Updated 29 January 2026
  • PythonSecurityEval is a comprehensive benchmarking suite that evaluates security-centric and general code quality properties in Python generated by LLMs.
  • It employs automated tools like Bandit and Pylint for fine-grained static analysis and severity-weighted scoring to identify vulnerabilities and quality issues.
  • The dataset supports feedback-driven security patching, enabling iterative refinement and detailed assessments of improvements in security, readability, and maintainability.

PythonSecurityEval is a comprehensive benchmarking suite for the evaluation of security-centric and general code quality properties in Python generated by LLMs. Designed to fill significant gaps left by conventional datasets focused primarily on functional correctness, PythonSecurityEval combines zero-shot, security-relevant natural language prompts, fine-grained static analysis labeling, and curated severity-weighted scoring mechanisms. The dataset is a central pillar for recent research on LLM-driven feedback, code refinement, and supply chain security detection in the Python ecosystem (Blyth et al., 20 Aug 2025, Alrashedy et al., 2023, Ryan et al., 14 Dec 2025, &&&3&&&).

1. Dataset Construction and Coverage

PythonSecurityEval originated as a large-scale, security-focused benchmark distinct from prior datasets (such as HumanEval or MBPP). Its construction was motivated by the absence of real-world, zero-shot security-relevant tasks suitable for LLM evaluation and patching.

  • Source and Sample Selection: Security-relevant prompts were mined from Stack Overflow, filtering for accepted answers applying Python modules in domains such as sqlite3, flask, subprocess, os, requests, pymongo, sqlalchemy, rsa, and hashlib. Each entry comprises a natural-language problem statement and corresponding function signature. Prompts specifying "write secure code" were excluded to enforce an unconstrained, realistic evaluation scenario (Alrashedy et al., 2023).
  • Statistics and Domains: The dataset contains 470 distinct prompt–function pairs (457 with paired unit tests). It systematically covers:
    • System/OS-level: 66.6%
    • Computation: 35.7%
    • Network/HTTP/URLs: 31.3%
    • Cryptography: 6.2%
    • Database: 24.3%
    • General-purpose: 88.1%
    • Web frameworks: 9.1%
  • Vulnerability Types: Automated Bandit analysis surfaces vulnerabilities corresponding to key CWEs, such as CWE-259 (hard-coded password), CWE-400 (uncontrolled resource consumption), CWE-78 (command injection), CWE-89 (SQL injection), CWE-22 (path traversal), and others (raw distributions given in Figure 1 of (Alrashedy et al., 2023)).

2. Labeling, Annotation, and Dimension Taxonomy

PythonSecurityEval emphasizes comprehensive multi-dimensional static analysis beyond mere correctness:

  • Automated Static Analysis:
    • Bandit is the primary security analysis tool, flagging vulnerabilities with severity and confidence scores mapped to CWEs. Every generated code snippet is passed through Bandit's >60 plugins.
    • Pylint assesses the following code quality aspects:
    • Convention (C): Readability (55 checks)
    • Warning (W): Reliability (155 checks)
    • Error (E): Correctness/Bugs (127 checks)
    • Refactor (R): Maintainability (76 checks)
    • Information (I): Metadata (9 checks)
    • Table 1 (Blyth et al., 20 Aug 2025) maps these aspects directly to analysis tools.
  • Labeling Protocol:
    • All issues reported by Bandit and Pylint (across the full set of categories and severities) are retained as ground-truth labels; no manual relabeling at the issue level.
    • Bandit flags each security issue with confidence and severity in {UNDEFINED, LOW, MEDIUM, HIGH}. Pylint categories are collected from out-of-the-box executions.
  • Expert Consensus:
    • Human experts only set category weights (for fitness scoring, see Section 4) and infer missing function signatures when omitted in prompts (these are recorded in the dataset metadata) (Blyth et al., 20 Aug 2025).

3. Schema, Format, and Access

Data is delivered as a JSON Lines file (python_security_eval.jsonl) using a transparent, extensible schema:

Field Type Description
prompt_id string Unique identifier
nl_prompt string Natural-language problem statement
test_suite array Pairs of {input, expected_output} for unit testing
model string E.g., "gpt-4o"
issuesSelected int or string Issues to select per iteration
iteration int Refinement round (0 = initial)
code string Generated Python code sample
passed_all_tests bool Functional correctness indicator
fitness_score float or –∞ Weighted score (see below)
total_severity int Weighted sum δ(S) of issue severities
bandit_issues array Each: {line, test, severity, confidence, description}
pylint_issues array Each: {line, code, category, message}
codeql_issues array (optional) Supplemental security findings

All records can be loaded for analysis with standard Python tooling:

1
2
import json
records = [json.loads(l) for l in open("python_security_eval.jsonl")]
The repository is available at https://github.com/Kamel773/LLM-code-refine under the MIT license (Blyth et al., 20 Aug 2025).

4. Scoring Metrics and Evaluation Protocols

PythonSecurityEval introduces a severity-weighted scoring framework to quantitatively assess LLM outputs beyond pass/fail correctness:

  • Severity-Weighted Sum:
    • Security HIGH: 30, MEDIUM: 20, LOW/UNDEFINED: 10
    • Convention/Error/Warning/Refactor: 3
  • Fitness Score:

f(S)={−δ(S)if S passes all tests −∞otherwisef(S) = \begin{cases} -\delta(S) & \text{if } S \text{ passes all tests}\ -\infty & \text{otherwise} \end{cases}

  • Vulnerability Rate:

VR=1N∑i=1N1[δ(yi)≠∅]VR = \frac{1}{N} \sum_{i=1}^N \mathbf{1}[\delta(y_i) \neq \emptyset]

where δ(yi)\delta(y_i) denotes Bandit's report on snippet ii.

  • Relative Improvement:

RFDSP=VRinit−VRFDSPVRinit×100%R_{\text{FDSP}} = \frac{VR_{\text{init}} - VR_{\text{FDSP}}}{VR_{\text{init}}} \times 100\%

Promotes transparent comparison of patching/repair systems (Alrashedy et al., 2023).

  • Refinement Protocols:
    • For each prompt, LLMs generate an initial code sample (y0y_0) and, if desired, iteratively refine it using static analysis feedback (static analysis as a feedback loop or FDSP, see Section 5).

5. Empirical Insights and Quality Distributions

Comprehensive statistics highlight the initial vulnerability of LLM outputs (as generated by GPT-4o) and the impact of static analysis-driven refinement:

Category Initial (%) After 10 Iterations (%) Δ (improvement)
Security >40 13.4 –26.6
Readability (C) >80 18.2 –61.8
Reliability (W) >50 12.8 –37.2
Maintainability ~15 2.2 –12.8
Errors (E) ~20 15.6 –4.4
  • Functional correctness (passing all tests) increases by approximately 10 percentage points across issue selection strategies (Table 3, (Blyth et al., 20 Aug 2025)).
  • The fraction of code snippets exhibiting at least one security flaw drops from >40% to 13%; readability violations from >80% to 18.2%, demonstrating systematic uplift in multi-dimensional code quality.
  • Empirically, LLMs guided by this feedback can repair a majority of security, readability, and reliability flaws while also improving maintainability and test pass rates.

6. Use Cases and Integration with Feedback-Driven Security Patching

PythonSecurityEval supports reproducible zero-shot benchmarking, automated security patching evaluation, and static analysis research:

  • Integration with FDSP:

As the canonical zero-shot benchmark, PythonSecurityEval underpins Feedback-Driven Security Patching (FDSP), where LLMs generate, analyze, and iteratively repair vulnerable code. FDSP, as described in (Alrashedy et al., 2023), outperforms self-feedback methods by up to 17.6% relative reduction in vulnerability rate, measured directly using this dataset.

  • General Research Applications:
    • Drop-in static analysis benchmark for LLM outputs
    • Fine-grained ablations on refinement rounds, issue-selection policies, and prompt configurations
    • Automated regression testing for security-mitigating strategies
    • Extensible schema enabling integration with dynamic analysis datasets and sequential pattern mining approaches

PythonSecurityEval complements, but is distinct from, several adjacent benchmarks:

Whereas PythonSecurityEval focuses on LLM code generation and static vulnerability assessment, (Ryan et al., 14 Dec 2025) targets fine-grained annotation of real-world malicious code within Python packages, using a statement-level taxonomy (47 malicious indicators, 7 behavioral types).

QUT-DV25 extends the static code and prompt approach of PythonSecurityEval by capturing eBPF-based dynamic traces from 14,271 Python packages, facilitating detection of multi-phase malware and covert network activity.

PySecDB catalogues real-world security‑related code commits, using graph neural architectures to identify security fix patterns and augmenting the space of security patches.

A plausible implication is that PythonSecurityEval fills a critical role in bridging LLM-centric code generation and repair with both static and dynamic threat detection workflows for Python.


References:

(Blyth et al., 20 Aug 2025, Alrashedy et al., 2023, Ryan et al., 14 Dec 2025, Mehedi et al., 20 May 2025, Sun et al., 2023)

Topic to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to PythonSecurityEval Dataset.