PythonSecurityEval: Security Benchmark

Updated 19 March 2026

PythonSecurityEval is a benchmark suite that evaluates Python code security, robustness, and quality using real-world prompts and CWE-based vulnerability taxonomies.
It integrates static analysis, dynamic test oracles, and iterative, LLM-driven repair methods to detect and reduce vulnerabilities effectively.
Real-world analyses and empirical metrics from datasets like PyPI validate its multidimensional approach to enhancing code security and maintainability.

PythonSecurityEval is a comprehensive benchmark and methodology suite designed to rigorously evaluate the security, robustness, and overall code quality of Python code—especially with respect to vulnerabilities induced or left unfixed by automated code generation, package maintenance practices, or static/dynamic analysis tools. It has evolved to encapsulate functional, security, and software engineering dimensions and is now central to empirical studies of Python code security in both academic and industry research.

1. Origins and Benchmark Definition

PythonSecurityEval originated as a large-scale, prompt-driven evaluation dataset targeting Python-specific security flaws generated by LLMs and other automated sources. The foundational dataset was released with 470 natural-language prompts, curated from real-world Stack Overflow tasks to reflect realistic and security-critical software development scenarios (Alrashedy et al., 2023). These prompts are domain-tagged based on associated library usage; e.g., system (os, subprocess) accounts for 66.6%, computation (numpy, pandas) for 35.7%, network (requests, flask) for 31.3%, cryptography (hashlib, Crypto) for 6.2%, database (sqlite3, psycopg2) for 24.3%, and web frameworks (Flask, Django) for 9.1%.

Each instance in PythonSecurityEval captures:

The prompt (task description)
Domains and libraries implicated
Optional scenario notes

No canonical reference code is supplied; security evaluation is performed against the code output by code-generation models or sampled package snapshots, using standardized detection tools.

This benchmark was subsequently extended to cover code quality aspects including security, reliability, readability, and maintainability, with dedicated static analysis instrumentation using Bandit and Pylint (Blyth et al., 20 Aug 2025).

2. Vulnerability Taxonomies and Coverage

Security evaluation across PythonSecurityEval leverages a taxonomy dominated by CWE-defined categories, with injection flaws (CWE-78 OS command injection, CWE-89 SQL injection) comprising ~40% of observed issues, and configuration/hard-coding problems (CWE-259, hard-coded password; CWE-400, resource consumption; CWE-20, input validation errors) constituting 20%. Typical coverage also includes cross-site scripting (CWE-79), path traversal (CWE-22), authentication/authorization failures (CWE-306, CWE-284), and exposure of sensitive data (CWE-200) (Alrashedy et al., 2023).

High-precision benchmarks such as PyVul further annotate over 1,000 real-world, developer-verified function/commit-level vulnerabilities with 151 CWEs across 28 meta-categories; the most prevalent are injection (17.5%), improper access control (11.5%), out-of-bounds read/write (9.9%), and input validation errors (6.5%) (Quan et al., 4 Sep 2025).

For cryptographic code, CIPHER supplies a domain-specialized vulnerability taxonomy chaining 58 fine-grained types (e.g., static IV, nonce reuse, missing authentication, certificate validation bypass), explicitly covering key management, randomness flaws, and protocol misconfigurations (Manolov et al., 1 Feb 2026).

3. Methodologies: Static and Dynamic Analysis, Multidimensional Metrics

PythonSecurityEval integrates both static and dynamic vulnerability assessment.

Static Analysis Feedback Loop

The static analysis feedback protocol iteratively refines code by alternating LLM generation with Bandit and Pylint checks. Issues are weighted by severity (HIGH=30, MEDIUM=20, LOW=10 for security; other classes=3), and a snippet's fitness is defined as

$f(S) = \begin{cases} -\delta(S) & \text{if unit tests pass}\ -\infty & \text{otherwise} \end{cases}$

where $\delta(S)$ sums the severity weights of all flagged issues (Blyth et al., 20 Aug 2025). Issue selection strategies empirically show optimal quality improvement for 3–5 high-severity issues per iteration, with successful convergence generally in under 10 rounds.

Outcome-Driven Test Oracles

Inspired by CWEval (Peng et al., 14 Jan 2025), dynamic oracles enforce both correctness (via I/O functional tests) and active security (via exploit payloads, integrity checks, timeouts). Key metrics formalize:

func@k: pass@k for functional correctness
func-sec@k: pass@k for functional + security correctness
Precision/Recall/F1: for vulnerability detection based on oracle outcomes

Security oracles test resilience to injection, resource exhaustion, serialization exploits, and path traversal under actively adversarial input—extending beyond the coverage of static pattern matching.

Empirical Metrics

Metrics computed routinely include:

VulnerabilityRate: fraction of cases with ≥1 vulnerability (per static analyzer or oracle)
FixRate: fraction of vulnerabilities corrected by a given patching methodology
Issue-Type Distribution: frequency per CWE/code quality class
Severity-weighted aggregates

For cryptography, line-level evidence spans and per-vulnerability precision/recall/F1 are measured via an LLM-judge with bootstrap CIs; CIPHER’s scoring achieves 87% precision and 85% recall (Manolov et al., 1 Feb 2026).

4. Automated Repair and Feedback-Driven Security Patching

PythonSecurityEval evaluates and enables automatic repair via external feedback. The FDSP (Feedback-Driven Security Patching) workflow uses Bandit reports as direct feedback to the refiner LLM (Alrashedy et al., 2023). After initial generation, the model is provided with both code and Bandit feedback, prompted to suggest multiple candidate fixes, and iteratively applies these until vulnerability-free code is found or a maximum patch depth is reached.

Empirical results with GPT-4 show:

Raw: 40.21% VulnerabilityRate
Direct Prompting/Self-Debugging: ≈25%
FDSP: 7.4% (an additional 17.6 percentage point reduction over prior state-of-the-art) (Alrashedy et al., 2023)

Static analysis-driven iterative prompting yields comparable relative reductions, with security issue rates falling from 42.7% to 13.8% after ten feedback iterations (Blyth et al., 20 Aug 2025).

5. Real-World Vulnerability Datasets, ML Architectures, and Quality of Detection

Security evaluation and tool benchmarking on Python code increasingly depend on rigorous, high-quality datasets and advanced graph/learning models:

PySecDB (Sun et al., 2023): 1,258 manually validated security commits, using CommitCPG graph representations (statement-nodes with contextual code, function, version) and a SCOPY multi-attributed GNN to identify security-related changes. Four fix patterns (sanity checks, API usage updates, regex changes, security property restrictions) cover 85.85% of observed repairs.
PyVul (Quan et al., 4 Sep 2025): 1,157 clean, function/commit-level, developer-verified vulnerabilities in 349 packages (>67% Python only; large multi-language fraction), labeled and validated by LLM-assisted expert review for 100% commit-level and 94.2% function-level accuracy.

Extensive evaluation of static analyzers (CodeQL, PySA, Bandit) and LLMs (CodeQwen, GPT-3.5/4) against PyVul shows sensitivity bottlenecks: CodeQL detects only 10.8% of real-world flaws; LLMs achieve modest performance (best F₁ ~75% for fine-tuned CodeQwen1.5). Major obstacles are limited taint modeling, function-only context, and high variation within CWE classes.

6. Broader Assessment: Large-Scale PyPI Analysis and Ecosystem Risks

Large-scale analysis of the PyPI registry highlights ecosystem-wide risks relevant to any PythonSecurityEval framework. In a 2020 crawl of nearly 200,000 packages, static analysis (Bandit) found that 46% of packages contain at least one issue, with generic risky patterns (catch-all exception handlers), injection vulnerabilities (notably via subprocess), and insecure function/import use dominating the frequency landscape (Ruohonen et al., 2021). Issue density correlates only weakly with package code size; small packages are nearly as likely to exhibit flaws as large ones.

Ecosystem threat modeling based on dependency graphs, maintainer “reach,” and dependency policies reveals a high attack surface due to the prevalence of typosquatting, malicious dependency graphs, install-time code execution (setup.py), and lack of enforced code-signing (Bagmar et al., 2021). Composite threat scores combining downstream reach, implicit trust, and presence of dangerous installation behavior yield actionable audit and gating policies.

7. Recommendations and Future Directions

Integrate both static (Bandit, CodeQL), dynamic (oracle-driven), and feedback-based analysis for holistic vulnerability detection.
Maintain continuously updated datasets and functionally-annotated, developer-verified corpora for benchmarking model and analyst performance.
Tailor metric reporting to application—vulnerability rate, func@k/func-sec@k, precision/recall, and per-CWE risk distribution.
Automate repair via LLMs with detailed external feedback, but validate fixes using iterative and multi-dimensional static analysis.
Augment code property graphs with language- and CWE-specific enhancements, support cross-language flow analysis, and extend to dynamic vulnerability classes (fuzzing, integration tests).
Leverage ecosystem-level metrics (dependency risk, maintainer centrality) for supply-chain security and prioritize high-risk strata for intensive evaluation.

By institutionalizing these approaches, PythonSecurityEval sets a rigorous, multidimensional foundation for the measurement, mitigation, and remediation of vulnerabilities in modern Python software (Alrashedy et al., 2023, Peng et al., 14 Jan 2025, Blyth et al., 20 Aug 2025, Quan et al., 4 Sep 2025, Sun et al., 2023, Bagmar et al., 2021, Ruohonen et al., 2021, Manolov et al., 1 Feb 2026).