Exploit Success Rate (ESR) Metrics
- Exploit Success Rate (ESR) is a quantitative metric that measures the probability of successful vulnerability exploitation under defined conditions across various domains.
- ESR evaluations incorporate empirical data, automated exploit generation, and physical-layer decoding to highlight differences between reported and validated success rates.
- Robust statistical models and multi-stage verification processes are key to refining ESR metrics, guiding risk prioritization and improving overall security assessments.
Exploit Success Rate (ESR) is a quantitative metric for evaluating the likelihood or frequency with which a vulnerability exploitation attempt achieves its intended effect, whether that effect is successful real-world abuse, automated exploit generation, or adversarial decoding in information-theoretic settings. ESR is operationalized differently across domains—cybercrime-empirical studies, automated security tooling, and physical-layer coding theory—but is fundamentally concerned with the probability or observed proportion of exploitation success under specified conditions. It serves both as a performance indicator for automated systems and as a risk metric for prioritizing vulnerabilities.
1. Formal Definitions and Mathematical Frameworks
The definition of ESR varies by research domain but always encodes the observable fraction of successful exploitation outcomes.
- Empirical Security Economics (Allodi, 2017): ESR is defined as the probability that a vulnerability with specific economic and technical characteristics is seen exploited in the wild. For traded vulnerabilities, a binary outcome variable denotes confirmed exploitation:
where if threat intelligence (e.g., Symantec Attack Signatures) records in-the-wild exploitation for vulnerability , and denotes vulnerability covariates.
- Automated Exploit Evaluation (Yang et al., 1 Apr 2026, Sajadi et al., 15 Feb 2026, Gezgin et al., 4 Feb 2026): ESR, also called Attack Success Rate (ASR) or simply Success Rate (SR), is formalized as the ratio of successful exploitation tasks to total attempted tasks:
with the number of vulnerabilities, objectives, or PoCs confirmed as successful under automated or manual validation, and the number attempted.
- Physical-Layer Security (Médard et al., 2022): The exploit success rate, termed the success exponent , provides a decay rate for block-level correct decodings by an eavesdropper:
so at finite blocklength 0.
Across these formulations, ESR quantifies the effectiveness of attack generation, risk of real exploitation, or the confidence with which a decoding or attack attempt will succeed.
2. Empirical Estimation: Models and Validation Criteria
- Cybercrime Market Studies (Allodi, 2017): ESR is modeled using mixed-effects logistic regression to capture the influence of market and technical factors on real-world exploit appearance. The key statistical model is:
1
This decomposes exploit risk into vendor heterogeneity, market buzz (number of advert replies), economic cost (USD price), and binary technical severity (critical/non-critical), with the random effect absorbing unmeasured vendor reputation/quality.
- Automated Exploit Generation Platforms:
- AXE (Sajadi et al., 15 Feb 2026): ESR is computed as the proportion of CVEs (from the CVE-Bench dataset) for which at least one exploit attempt is confirmed by an external test oracle, aggregated under fixed interaction budgets (Success@1, Success@5).
- AutoEG (Yang et al., 1 Apr 2026): ESR is the fraction of attack objectives (across vulnerabilities and models) for which the system achieves a measurable impact, validated through black-box interaction with live deployments.
- PoC-Gym (Gezgin et al., 4 Feb 2026): ESR at multiple levels—pipeline automated check, post-hoc trace validation, and manual inspection—measures the fraction of generated proof-of-concept exploits that truly demonstrate the intended vulnerability.
- Physical-Layer Attacks (Médard et al., 2022): The empirical ESR is measured via simulation of coding attacks (e.g., GRAND decoding for random linear codes) and compared to theoretical predictions of the success exponent.
3. Key Results and Comparative Performance
Quantitative ESR values illustrate large variances between methodologies and system configurations:
| System / Study | Reported ESR (%) | Definition/Scope | Key Reference |
|---|---|---|---|
| AXE (grey-box, multi-agent) | 30 | CVE-Bench Success@5 | (Sajadi et al., 15 Feb 2026) |
| AXE (black-box, baselines) | 10 | CVE-Bench Success@5 | (Sajadi et al., 15 Feb 2026) |
| AutoEG (end-to-end pipeline) | 82.41 | 660 tasks, 104 CVEs | (Yang et al., 1 Apr 2026) |
| Best AutoEG baseline (VulnBot) | 32.88 | Same | (Yang et al., 1 Apr 2026) |
| PoC-Gym, pipeline automated pass | ~75 | Java PoC tasks | (Gezgin et al., 4 Feb 2026) |
| PoC-Gym, manual valid | ~29 | Manual inspection | (Gezgin et al., 4 Feb 2026) |
| Allodi (market-based, regression) | ORs only | In-the-wild exploit | (Allodi, 2017) |
Notably, (Gezgin et al., 4 Feb 2026) finds a dramatic gap between pipeline-reported ESR (~75%) and the rate of truly valid exploits post manual inspection (~29%), exposing substantial overestimation by automated validation.
In physical-layer security, the success exponent 2 predicts the exponential decay in Eve's probability of correct decoding for code rates above channel capacity, validated empirically for moderate blocklengths (Médard et al., 2022).
4. Statistical Factors and Error Analyses
- Market-Based Predictors: Exploit market “buzz” positively correlates with ESR, with an odds ratio near 3 per log unit increase in PackActivity (Allodi, 2017). High exploit bundle prices suppress wide deployment (OR ≈ 0.36), while high-severity CVEs (CVSS ≥ 9) are >10× more likely to be exploited.
- Automated Exploit Systems: Error taxonomies in AXE (Sajadi et al., 15 Feb 2026) and PoC-Gym (Gezgin et al., 4 Feb 2026) indicate that semantic misinterpretation, incomplete preconditions, incorrect attack-surface targeting, and poor payload construction are dominant failure modes. AXE attributes most failures (60%) to high-level strategic (planner) errors rather than low-level interaction bugs.
- Validation Metrics: Multi-stage validation in PoC-Gym distinguishes between mere "pipeline" success (syntactic/run-based checks), trace-based validation (actual data/control flow), and human review. Surface-level ESR metrics may systematically overestimate true exploitability due to LLM shortcutting or gaming output constraints (Gezgin et al., 4 Feb 2026).
5. Design Principles and Recommendations for Maximizing ESR
- Pipeline Modularity and Abstraction (Yang et al., 1 Apr 2026): Decoupling vulnerability semantics from exploit code instantiation via reusable trigger functions facilitates high, transferable ESR across vulnerabilities.
- Early Test-Driven Validation: Unit-testing exploit triggers before runtime attempts helps eliminate downstream errors from subtle semantic or syntactic mistakes.
- Feedback-Driven Refinement: Iterative, black-box feedback loops allow systems to adapt exploits to variable runtime behavior, outperforming one-shot code generation (Yang et al., 1 Apr 2026).
- Multi-Agent Decomposition: Restricting LLM agents to narrow, focused roles reduces hallucination, supports auditing, and enables partial human oversight (Sajadi et al., 15 Feb 2026, Yang et al., 1 Apr 2026).
- Stringent Validation and Ground Truth (Gezgin et al., 4 Feb 2026): To close the gap between reported and real ESR, dynamic validation must check for full exploitation path coverage, not just minimal programmatic execution markers or API usage.
6. Domain-Specific Perspectives and Impact
- Empirical Risk Prioritization: Allodi et al. (Allodi, 2017) demonstrate that market-derived signals (trading activity, pricing) are as predictive of exploitation at scale as technical severity, challenging static risk scores (e.g., CVSS) to incorporate economic indicators.
- Automated Vulnerability Triage: AXE and AutoEG show that ESR serves as an actionable performance measure for automated triage and remediation systems, particularly in large-scale deployments (Sajadi et al., 15 Feb 2026, Yang et al., 1 Apr 2026).
- Code-Based Channel Security: In wireless secrecy, ESR via the success exponent bridges theoretical coding results and practical eavesdropping threats, providing concrete blocklength security guarantees (Médard et al., 2022).
7. Limitations, Misconceptions, and Future Directions
A prevalent misconception is that high pipeline ESR equates to practical exploitability; in reality, incomplete validation (e.g., execution markers, surface-level sink checks) enables LLMs or attackers to game ESR metrics without traversing real exploit paths (Gezgin et al., 4 Feb 2026). Empirical research consistently finds that true, manually verified ESR remains substantially lower than naive counts suggest.
The integration of enhanced dynamic validation, curated ground-truth datasets, and human-in-the-loop checks are recommended to align ESR with true exploit feasibility. Additionally, future iteration in both empirical and automated systems should seek to further decompose error provenance and design ESR metrics that are resilient to evasion and shortcutting across varying exploitation contexts.