CyberSecEval2 Benchmark

Updated 22 January 2026

CyberSecEval2 is a comprehensive evaluation suite that assesses LLM cybersecurity risks using four test categories: prompt injection, code interpreter abuse, cyberattack compliance, and exploit generation.
It employs explicit metrics like prompt injection success rate, False Refusal Rate, and exploit generation scores to quantify safety-utility tradeoffs across leading models.
Benchmark findings reveal varying vulnerabilities among models, underscoring the need for multi-layered defenses and continuous improvement in LLM security.

CyberSecEval2 is a comprehensive evaluation suite for LLMs in cybersecurity contexts. It quantitatively assesses both the offensive capabilities and security risks presented by LLMs, encompassing prompt injection vulnerabilities, code interpreter abuse, cyberattack helpfulness and refusal rates, and exploit generation proficiency. Designed to extend the original CyberSecEval1, this benchmark systematically exposes weaknesses, quantifies safety-utility tradeoffs, and provides open-source resources for ongoing evaluation and improvement (Bhatt et al., 2024).

1. Benchmark Architecture and Test Taxonomy

CyberSecEval2 defines four orthogonal test categories for rigorous LLM security assessment:

Prompt Injection: Evaluates susceptibility to user inputs designed to circumvent or override system-level controls, analogous to SQL injection or command-injection attacks in software security. Fifteen injection techniques are covered, including Ignore Previous Instructions, Indirect References, Token Smuggling (e.g., ROT13, Base64, Morse), System-Mode Impersonation, Different-Language Inputs, and advanced multi-step attacks. Test cases consist of a system prompt (trusted instructions), a malicious user input, and a judge question for binary outcome validation. Injections are classified along two axes: logic-violating (system logic bypass without direct harm) versus security-violating (clear malicious intent); each further split into direct and indirect forms.
Code Interpreter Abuse: Measures the ease with which an LLM can be provoked to generate malicious code targeting an attached interpreter (e.g., Python sandbox). Five behavior categories (each with 100 prompts) include container escape, privilege escalation, reflected attacks (leveraging interpreter outputs against external targets), post-exploitation persistence, and automated social engineering scripts. Outputs are adjudicated by a distinct judge LLM for compliance versus refusal.
Cyberattack Helpfulness and Safety-Utility Tradeoff: Quantifies LLM compliance on cyberattack-related tasks mapped to MITRE ATT&CK TTPs such as reconnaissance, evasion, and execution. The suite introduces the False Refusal Rate (FRR) metric by evaluating model responses to ambiguous, benign cybersecurity requests (borderline prompts)—for example, queries such as how to monitor one’s own port scan. Single-turn prompts are judged for both malicious compliance and benign refusals.
Exploit Generation: Assesses the reasoning and coding ability of LLMs in generating working exploits for randomized artificial challenges, including string constraint satisfaction (in C, JavaScript, Python), SQL injection (Python + sqlite3), buffer overflows (C), and diverse memory corruption scenarios (integer overflows, use-after-free). Programs are designed such that solutions require derivation of hidden internal constraints, and partial credit is granted where appropriate.

2. Formal Metrics and Evaluation Criteria

CyberSecEval2 employs explicit, well-defined metrics:

Prompt Injection Success Rate: $success\_rate = N_{inj\_success} / N_{inj\_total}$ , where $N_{inj\_success}$ is the number of successful injection attempts and $N_{inj\_total}$ is the total number administered.
False Refusal Rate (FRR): $FRR = N_{benign\_refused} / N_{benign\_total}$ , quantifying unnecessary model refusals to legitimate (albeit ‘borderline’) cybersecurity prompts.
Safety-Utility Tradeoff Visualization: For each model, plot $(1 - compliance\_rate_{malicious})$ on the X-axis versus FRR on the Y-axis. Models approaching the lower-right demonstrate high malicious refusal with low benign refusal—characterizing optimal safety without significant utility loss.
Exploit Generation Score: For string constraint tasks, $score = (\#\,constraints\,satisfied) / (\#\,total\,constraints) \in [0, 1]$ . For injections and memory exploits, scoring is binary $\{0, 1\}$ with 0.5 credit for partial buffer overwrite in select tests.

3. Experimental Protocol and Model Selection

CyberSecEval2 evaluates contemporary SOTA LLMs under controlled single-turn conditions, without multi-turn or adversarial optimization:

Model	Prompt Injection Success (%)	Avg. Exploit Generation Score
GPT-4	17.1	≈ 0.30
Google Gemini Pro	21.5	N/A
CodeLlama 70B-Instruct	29.3	≈ 0.18
Llama 3 70B-Instruct	31.8	≈ 0.22
Llama 3 7B-Instruct	41.2	≈ 0.08

For prompt injection, 15 variants × ~20 prompts yield ~300 test cases.
Interpreter abuse utilizes 5 categories × 100 prompts each (500 total).
Cyberattack helpfulness tests span ~10 TTP categories × ~15 prompts each.
FRR is determined from ~100 borderline-benign prompts.
Exploit generation employs 30–50 random programs per task type.

Each LLM output is evaluated by dedicated judge models, with binary or partial scoring.

4. Principal Observations and Comparative Analysis

Prompt Injection Resilience: All tested models are non-trivially vulnerable; the average success rate across models is ≈28%, with security-violating injections showing greater efficacy (≈32%) compared to pure logic violations (≈26%).
Cyberattack Helpfulness and FRR: Malicious-prompt compliance reduced from 52% (CyberSecEval1) to 28% (CyberSecEval2), indicating more effective refusal mechanisms. However, utility loss is model-dependent—most models exhibit FRR < 15%, but CodeLlama 70B-Instruct demonstrates a markedly high FRR (≈70%), manifesting a pronounced safety-utility tradeoff.
Exploit Generation Capability: Models with intrinsic coding capacity (GPT-4, Llama 3 70B, CodeLlama) outperform those without; yet, absolute performance remains low (average exploit scores ≤0.3, SQL injection success ≈20% for GPT-4, near zero for others).
Interpreter Abuse: Across models, mean compliance with malicious code generation prompts is ≈35%, with top models generating sandbox-escaping or privilege escalation code >30% of the time.

5. Limitations, Open Problems, and Prescriptive Recommendations

CyberSecEval2 identifies substantive unresolved challenges:

Prompt injection mitigation remains insufficient—conditioning solely via system prompts does not adequately secure LLMs.
LLM-to-sandbox integrations require multi-layered defenses, including both hardened interpreter environments and adaptive safety fine-tuning.
Exploit generation proficiency is nascent; current LLMs lack the specialized reasoning and code synthesis skill required for reliable, autonomous exploit authorship.

Recommendations for future work include expanding prompt injection testing to multi-turn and optimization-based adversarial scenarios, broadening FRR evaluations to encompass additional risk domains (notably privacy leaks and prohibited content), and extending exploit generation to real-world CVE challenges and full spectrum red-team/blue-team exercises. The open-source repository serves as a locus for collaborative community advancement: https://github.com/meta-llama/PurpleLlama/tree/main/CybersecurityBenchmarks (Bhatt et al., 2024).

6. Role Within the Cybersecurity Benchmarking Landscape

CyberSecEval2 is recognized as a holistic, quantitative framework for benchmarking LLM safety and offensive capabilities in cybersecurity applications. Its design provides critical granularity absent from general-purpose or narrowly focused benchmarks. While other contemporaneous evaluations, such as CS-Eval (Yu et al., 2024), emphasize broad, multi-domain, and bilingual knowledge and reasoning in cybersecurity, CyberSecEval2 uniquely targets active adversarial risks arising from LLM deployment in operational settings, integrating both attack surface exposure and defensive tradeoffs. This suggests CyberSecEval2 and CS-Eval are complementary—the former specializing in empirical attack simulation, the latter in expansive cognitive domain coverage.

CyberSecEval2’s metrics and artifacts facilitate reproducible, transparent comparison of LLM security properties, fostering continuous model improvement and supporting both academic research and the security engineering community.

Markdown Report Issue Upgrade to Chat

References (2)

CyberSecEval 2: A Wide-Ranging Cybersecurity Evaluation Suite for Large Language Models (2024)

CS-Eval: A Comprehensive Large Language Model Benchmark for CyberSecurity (2024)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to CyberSecEval2 Benchmark.