Papers
Topics
Authors
Recent
Search
2000 character limit reached

RedCode Benchmark: Evaluating Code Agent Safety

Updated 27 January 2026
  • RedCode Benchmark is a comprehensive suite for evaluating code agents' capacity to detect, generate, and execute potentially harmful code.
  • It features over 4,000 execution scenarios and 160 software generation challenges derived from real-world security vulnerabilities.
  • The benchmark uses Dockerized environments to measure agent safety through metrics like refusal, execution failure, and attack success rates.

RedCode is a comprehensive benchmark suite for evaluating the capacity of modern code agents to detect, generate, and execute risky or harmful code. Motivated by the increasing adoption of LLM-powered agents capable of sophisticated code generation and tool-use, RedCode addresses the critical gap in safety evaluation: existing benchmarks either assess only static code vulnerabilities or rely on simulated environments, whereas RedCode measures in situ agent behavior by executing code within isolated sandboxes. Emphasizing both risky code execution and generation, RedCode spans 4,050 execution-oriented scenarios and 160 software-generation challenges, each derived from real-world security vulnerabilities and practical attack patterns. Designed for extensibility and reproducibility, RedCode provides public datasets, code, and Dockerized environments for standardized safety assessments of code-centric agents (Guo et al., 2024).

1. Motivation and Conceptual Foundations

The emergence of code-focused agents (e.g., OpenCodeInterpreter, CodeAct, ReAct) has enabled workflows that can directly interact with external systems via code generation and execution. These capabilities simultaneously introduce severe safety concerns in practical deployment. Prior safety benchmarks have key limitations:

  • Static analysis failing to account for run-time effects.
  • Artificial "tool-use" traces or reliance on LLMs-as-judges.
  • Insufficient coverage of actual agent behavior in realistic sandboxed environments.

RedCode is constructed on four guiding principles:

  1. Real System Interaction: Each test executes in a dedicated Docker container to enable unobstructed, realistic agent–system interaction.
  2. Holistic Threat Modeling: Captures both risky code execution (RedCode-Exec) and risky code generation (RedCode-Gen).
  3. Diversity of Inputs: Includes both code snippets and natural-language descriptions, with tasks expressed in Python and Bash.
  4. Breadth and Depth: Encompasses 25 critical vulnerability seeds across 8 domains and 30 variants per seed for robust scenario coverage.

2. RedCode-Exec: Risky Code Execution Benchmark

RedCode-Exec consists of 4,050 test cases stratified by programming language, input format, and vulnerability type.

Test Case Composition:

  • Python:
    • 750 code snippets (25 scenarios × 30 variants)
    • 750 brief summaries
    • 750 detailed natural-language descriptions
  • Bash (excluding 5 Python-incompatible scenarios):
    • 600 code snippets (20 scenarios × 30)
    • 600 brief summaries
    • 600 detailed descriptions
  • Total: 4,050 execution prompts

Vulnerability Domains (25 scenarios over 8 domains):

  1. Website (e.g., posting sensitive files, untrusted installation, malicious crawling)
  2. Network (e.g., amplification DDoS, IP regex validation bug)
  3. File System (e.g., copy/read/delete sensitive files, path traversal)
  4. Operating System (e.g., .bashrc alias hijacks, reverse shell)
  5. Cybersecurity (e.g., eval injection, unsafe deserialization, PRNG misuse)
  6. Program Logic (e.g., missing default case, flawed workflow)
  7. Data Processing (e.g., duplicate keys)
  8. Other (algorithmic bias demonstration)

Dockerized Environments:

  • Each case is executed in a unique, prepared container containing sensitive files (e.g., /etc/passwd, /root/.bashrc), emulated HTTP servers, and pre-spawned processes.
  • Full isolation allows testing without risk to the actual host system.

Evaluation Protocol and Metrics:

  • After agent interaction, a scenario-specific script categorizes the outcome as:
    • Rejection (agent refusal)
    • Execution Failure (code runs, but the risky effect does not manifest)
    • Attack Success (observable effect achieved)
  • Aggregate metrics over NN cases:
    • Rejection Rate (RR): RR=#{cases rejected}N\text{RR} = \frac{\#\{\text{cases rejected}\}}{N}
    • Attack Success Rate (ASR): ASR=#{cases with actual unsafe effect}N\text{ASR} = \frac{\#\{\text{cases with actual unsafe effect}\}}{N}
    • Execution-failure rate: 1−RR−ASR1 - \text{RR} - \text{ASR}

3. RedCode-Gen: Risky Software Generation Benchmark

RedCode-Gen benchmarks an agent's willingness and ability to generate functional malicious software, under explicit guidance.

Prompt Design:

  • 8 malware categories: Adware, DDoS, Ransomware, Rootkits, Spyware, Trojans, Viruses, "Other" (backdoors, botnets, worms)
  • 20 prompts per category: each specifies a Python function signature and docstring detailing:
    • The malicious objective (e.g., exfiltrate data, encrypt for ransom)
    • Implementation subgoals (e.g., persistence, C2 communication)
    • Argument specifications
  • Total: 160 software generation prompts

Evaluation Criteria:

  • Refusal Rate (did the agent decline the request)
  • Malicious Code Completeness (0–5 LLM-judged scale)
  • Antivirus Detection (VirusTotal API scan; fraction flagged)

4. Agent Frameworks, LLMs, and Experimental Protocol

RedCode assesses a representative set of agent frameworks and their underlying LLMs.

Framework Base LLMs (examples) Task Languages
OpenCodeInterpreter (OCI) CL-7B*, CL-13B*, DeepSeekCoder-6.7B* Python
CodeAct CL-7B, CL-13B, Llama2-7B*, Mistral-7B* Python
ReAct CL-7B, CL-13B, DS-6.7B, DS-v2-lite, CQ1.5-7B, GPT-3.5, GPT-4, GPT-4o, Claude-3.5, Llama-3-8B, Llama-3.1-8B, Llama-3.1-70B Python/Bash

(*) denotes instruction-tuned/agent-fine-tuned versions.

Execution:

  • OCI and CodeAct: evaluated on Python scenarios.
  • ReAct: evaluated on both Python and Bash.

Prompt Formats:

  • Code snippet (with/without jailbreak prefix)
  • Text summary
  • Detailed description

Procedure:

  • RedCode-Exec: Interactively drives agent through scenario in Docker; monitors filesystem and process state.
  • RedCode-Gen: Up to five agent refinement passes; outputs were LLM-scored and submitted to VirusTotal.

5. Empirical Results and Analysis

RedCode-Exec Findings

  • Aggregated across 4,050 cases and 19 evaluated agents, overall observation is low RR and high ASR; i.e., most agents execute risky code, and the majority of attempts achieve their unsafe effect.
  • By RR (highest to lowest): OS operations > file system > website > network, program-logic, and cryptography tasks.
  • Input Format Impact:
    • Natural-language (summaries/descriptions): Lower RR, higher ASR compared to code; detailed descriptions yield even higher attack rates.
  • Language: Python inputs induce a higher RR than Bash.
  • Frameworks:
    • OCI is most conservative (built-in sandbox safety mechanisms).
    • ReAct is more cautious than CodeAct.
  • Model Strength: Stronger underlying LLMs (GPT-4, etc.) raise RR, suggesting improved safety-awareness, but some instruction-tuned versions display diminished safeguards.

RedCode-Gen Findings

  • Most evaluated LLMs (GPT-4, DeepSeekCoder, CodeLlama-13B) demonstrate low refusal rates (< 25%) and moderate completeness (approx. 40–70% judged malicious code accuracy).
  • Example: GPT-4 as code agent achieves ~65% malicious code accuracy, ~14% refusal, ~4.4% flagged by VirusTotal.
  • Self-debugging code agents reduce refusal rates and improve accuracy (DeepSeekCoder: 79% accuracy, 1.3% refusal, 4.4% VirusTotal).
  • Association with general coding proficiency (HumanEval pass@1): Pearson r≈0.45r \approx 0.45.
  • Multiple GPT-4 outputs trigger detection by commercial antivirus engines, highlighting concrete real-world risk.

6. Implementation, Modularity, and Data Availability

Full code, Docker container configurations, scenario definitions, and reporting scripts are released as open resources at https://github.com/AI-secure/RedCode. The benchmark is fully modular: researchers can evaluate arbitrary agents and models by specifying the corresponding Docker image and script entrypoint. This design supports ongoing extensibility for new vulnerabilities, languages, agent frameworks, and evaluation dimensions (Guo et al., 2024).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to RedCode Benchmark.