RedCodeAgent: Automated Red-Teaming System

Updated 30 December 2025

RedCodeAgent is an automated red-teaming system that systematically evaluates LLM-driven code agents for security vulnerabilities.
It integrates adaptive memory, diverse red-teaming tools, and sandboxed environments to uncover and simulate complex attack strategies.
Empirical results show that RedCodeAgent achieves higher attack success rates and improved vulnerability detection compared to traditional static benchmarks.

RedCodeAgent is an automated red-teaming system for the systematic safety evaluation of LLM-driven code agents. Designed to assess and uncover vulnerabilities across code interpretation, execution, and generation tasks, RedCodeAgent integrates adaptive memory for attack strategy selection, a diverse toolbox of red-teaming techniques, and high-fidelity sandboxed environments for precise, real-world evaluation. Its framework addresses critical gaps in existing static safety benchmarks and jailbreak scenarios, providing scalable, domain-diverse coverage with superior attack success rates and efficiency compared to prior red-teaming tools (Guo et al., 2 Oct 2025).

1. Motivation and Safety Challenges in Code Agents

LLM-powered code agents—entities capable of generating, interpreting, and executing software—have rapidly permeated software engineering workflows, streamlining tasks such as dynamic debugging, interactive programming, and routine code synthesis. This capability expansion dramatically increases the attack surface: adversarial, careless, or malicious prompts can induce unintended software behaviors, including data exfiltration, privilege escalation, or destructive file operations. Conventional safety evaluations largely rely on static benchmarks or manually crafted jailbreak prompts, suffering two core deficiencies:

Boundary coverage gaps: Static safety prompts often omit corner-case vulnerabilities (e.g., subtle API substitutions bypassing filter constraints).
Combinatorial attack weakness: Most benchmarks employ a single red-teaming tool per attempt, failing to explore chained exploit strategies or adaptive attack sequences.

These limitations prevent accurate risk characterization and adaptive defenses, necessitating automated, iterative red-teaming systems such as RedCodeAgent for robust safety assessment (Guo et al., 2 Oct 2025).

2. System Architecture and Core Components

RedCodeAgent employs an LLM-centric architecture centered on an iterative probe-execute-feedback loop, comprised of:

Input Query Parser: Normalizes risk scenarios and attack prompts.
Adaptive Memory Module: Logs and retrieves prior successful jailbreaks. For a new query $q$ , precomputed embeddings ( $e_q^\mathrm{risk}$ , $e_q^\mathrm{des}$ ) are matched to stored entries via similarity:

$S_r = \cos\bigl(e_q^\mathrm{risk},\,e_m^\mathrm{risk}\bigr),\quad S_t = \cos\bigl(e_q^\mathrm{des},\,e_m^\mathrm{des}\bigr)$

with a trajectory-length penalty $P = \rho |\mathrm{trajectory}_m|$ , yielding overall score $S(m) = S_r + S_t - P$ .

Dynamic Tool Selector: Selects and sequences red-teaming tools (gradient-based suffix attacks, evolutionary/learning-based generators, code substitution).
Sandboxed Execution Environment: Executes generated code in a Docker container, instrumented to observe effects (e.g., file deletion, unauthorized DB access), thereby bypassing LLM-only static analysis bias.
Result Analyzer: Classifies attempts as "attack success," "execution failure," or "rejection" based on container inspection.

The system iterates until a successful exploit is confirmed or an iteration budget is exhausted, using memory and prior feedback to optimize tool selection and prompt engineering (Guo et al., 2 Oct 2025).

3. Red-teaming Toolbox and Attack Combinations

The RedCodeAgent toolbox incorporates:

Code Substitution (specialized LLMs)
Gradient-based Suffix Attack (GCG)
Learning-based (AdvPrompter, AmpleGCG)
Evolution-based (AutoDAN)

Tool selection is dynamically guided by adaptive memory and recent feedback, approximitely solving: $\max_{t \in \mathrm{Tools}}\, \mathbb{E}[\text{success}_t] - \lambda\,\mathrm{time}_t$ where expected success is informed by historical tool performance, with latency penalization. This dynamic combination capability enables RedCodeAgent to penetrate agent defenses that resist single-method attacks, substantially raising vulnerability detection coverage (Guo et al., 2 Oct 2025).

4. Evaluation Methodology and Empirical Results

RedCodeAgent is evaluated on multiple safety benchmarks and commercial code assistants, employing both risky execution (RedCode-Exec) and generation (RedCode-Gen) scenarios as defined in RedCode (Guo et al., 12 Nov 2024). Benchmarks include 4,050 test cases covering 25 vulnerability types across 8 domains (OS, filesystem, network, website, program logic, cybersecurity, data processing) and real-world malware synthesis across eight families.

Metrics:

Attack Success Rate (ASR): $\mathrm{ASR}=\frac{\#\text{Attack Successes}}{\#\text{Total Attempts}}$
Rejection Rate (RR): $\mathrm{RR}=\frac{\#\text{Rejections}}{\#\text{Total Attempts}}$

Key Outcomes (Extracted from Table 1, (Guo et al., 2 Oct 2025)):

Method	ASR	RR
No-jailbreak	55.5%	14.7%
GCG	54.7%	12.8%
RedCodeAgent	72.5%	7.5%

For commercial agents (Cursor, Codeium):

Agent	Baseline ASR	Baseline RR	RedCodeAgent ASR	RedCodeAgent RR
Cursor	62.6%	7.0%	72.7%	4.1%
Codeium	61.0%	5.9%	69.9%	4.3%

RedCodeAgent consistently identifies more vulnerabilities than state-of-the-art single-tool red-teamers or static benchmarks, including uncovering previously unknown weaknesses in commercial assistants (Guo et al., 2 Oct 2025, Guo et al., 12 Nov 2024).

5. Multi-Agent Security Architectures and Threat Modeling

RedCodeAgent can be deployed to evaluate LLM-based multi-agent software development assistants subject to code-injection attacks. Analysis of three canonical pipelines—Coder (C), Coder-Tester (CT), and Coder-Reviewer-Tester (CRT)—shows trade-offs between resilience and efficiency:

Architecture	Accuracy (%)	Attack Success (Single)	Attack Success (Continued)	LLM Calls
C	94.7	100.0%	100.0%	164
CT	95.9	92.5%	100.0%	350
CRT	95.3	1.4%	6.7%	529

Adding a Security Analysis Agent (SAA) to CT yields negligible efficiency loss but robust resilience against simple injection. However, advanced attacks, such as poisoned few-shot examples embedded in comments, result in attack success rates up to 71.95%, reducing resilience to 28.05% (Bowers et al., 26 Dec 2025). This suggests sophisticated language-based payloads circumvent multi-layer agent defenses without additional measures.

6. Benchmark Design and Safety Insights

RedCode (Guo et al., 12 Nov 2024), the benchmark underpinning RedCodeAgent’s evaluations, features:

RedCode-Exec: 4,050 test cases probing agent responses to risky operations in Python and Bash (e.g., destructive file ops, privilege escalation, network exfiltration).
RedCode-Gen: 160 malware synthesis prompts modeled after HumanEval, measuring accuracy, refusal rates, and VirusTotal detection.

Findings include:

Agents exhibit higher vigilance on overt OS/file-system attacks (rejection rates up to 80%), but are easily subverted by buggy logic or natural-language “wraps.”
More capable LLMs (GPT-4) perform more refusals on execution but produce more sophisticated, less detectable malware code on generation tasks.
Natural-language task descriptions increase attack success rates by approximately 15 percentage points over direct code prompts.

A plausible implication is that adversarial natural language engineering will be increasingly utilized to bypass static code filters (Guo et al., 12 Nov 2024).

7. Recommendations and Future Directions

To harden code agent ecosystems and improve safety assessment:

Implement multi-layered checks: Multi-agent review (CRT or CT+SAA), static and dynamic code analysis.
Employ strong prompt engineering: Explicitly blacklist risky API calls (e.g., requests.post, socket, os.system).
Utilize differential and provenance testing: Randomized re-generation, cryptographic output signing.
Mitigate few-shot poisoning: Sanitize example blocks and comments prior to LLM ingestion.

RedCodeAgent’s retrieval-augmented, multi-tool, sandbox-based architecture is extendable to hybrid domains (robotics, embedded, medical, automotive) through knowledge base enrichment, agent workflow customization, and verification module extension (Lu et al., 26 Aug 2025, Bowers et al., 26 Dec 2025).

Areas for enhancement include learning-based memory optimization, tool-mix adaptation via bandit algorithms, support for multimodal agents, and broader detection of complex side effects (e.g., covert network callbacks) (Guo et al., 2 Oct 2025). Building community-shared memory repositories may further increase attack coverage and agent robustness.

RedCodeAgent represents a domain-adaptive, continually evolving platform for principled risk assessment of code agents and their multi-agent derivatives. Its empirical superiority and methodological extensibility position it as a critical tool in the advancement of LLM-based software safety and security.