InterCode-CTF Benchmark for Cybersecurity Agents

Updated 21 January 2026

InterCode-CTF is a standardized, containerized benchmark designed to evaluate and train language models in interactive, multi-step offensive security challenges.
It comprises 91 challenges across cryptography, forensics, reverse engineering, and other CTF genres, calibrated to high-school level difficulty.
Its Docker-based infrastructure and explicit evaluation protocols enable rigorous comparative analysis of agent performance and success metrics.

InterCode-CTF is a standardized, containerized benchmark for evaluating and training LLM (LM) agents in end-to-end offensive security reasoning, specifically in Capture-The-Flag (CTF) cyber-challenge scenarios. Derived from picoCTF's high-school-level problems, it has become a pivotal suite for both academic and industrial research exploring interactive code execution, vulnerability discovery, and agent intelligence under real-world security constraints. InterCode-CTF's fully-automated infrastructure, reproducible task set, and explicit evaluation protocol enable rigorous comparative analysis of agent capabilities in multi-step exploitation and reverse engineering contexts (Yang et al., 2023, &&&1&&&, Turtayev et al., 2024).

1. Benchmark Composition and Task Taxonomy

InterCode-CTF comprises 91 containerized challenges spanning the main CTF genres:

Category	#Tasks	Representative Content
Cryptography	16	Classic ciphers, modular arithmetic, RSA
Forensics	13	Packet analysis, steganography, file carving
Pwn (Memory Corruption)	2	Buffer overflows, format-string issues
Reverse Engineering	27	Binary disassembly, unpacking, dynamic analysis
Web	2	SQL injection, template injection
Miscellaneous	31	Logic, DevTools, general puzzles

Every challenge mandates interactive exploration—typically decompile→patch, debug→fuzz, or multi-stage CLI+Python workflows—for runtime-based flag discovery rather than static code generation (Yang et al., 2023, Zhuo et al., 25 Aug 2025). Task difficulty is calibrated at the “High School” level by picoCTF’s rubric; empirical solution lengths peak at 5–15 agent turns (with a heavy tail up to 40).

2. Environment and Execution Framework

InterCode-CTF leverages isolated Docker-based containers, offering a reproducible and secure sandbox for each instance. Executed via orchestration tools such as EnIGMA+ or CTF-Forge, each container exposes:

Ubuntu 20.04 (32/64-bit) base with statically configured toolchains (GCC, Python 3.8+, binutils, gdb, netutils, web servers).
Specialized services: Apache/PHP for web, socat+binary for pwn, Python servers for crypto; each orchestrated via docker-compose.yml and parameterized by challenge.json metadata for ports and flag paths.
Access Control: No privileged mode; resource isolation (CPU shares, RAM caps); all networked on a “ctfnet” bridge.
Agent Interface: Bash-like REPL plus custom agent tools (decompile, disassemble, debug_*, connect_start, etc.), allowing up to 40 turns per session, with the episode ending on correct flag submission via submit '<flag>' (Zhuo et al., 25 Aug 2025, Abramovich et al., 2024).

This infrastructure supports automated reset and evaluation cycles, crucial for benchmarking stochastic machine learning agents.

3. Evaluation Protocols and Success Metrics

The benchmark formalizes evaluation as an episodic partially observable Markov decision process (POMDP):

Agent receives only the CTF prompt and may issue arbitrary shell or Python commands.
At each step, the agent observes the output of its command, with the internal container state non-observable except via such interaction.
The only reward is upon successful submission of the exact hidden flag.

The following metrics are standardized:

Pass@k: For $k$ independent agent runs per task,

$\text{Pass@}k = 1 - \prod_{i=0}^{k-1} \frac{n-c-i}{n-i}$

where $n$ is total rollouts, $c$ is number of correct flags.

SuccessRate (Pass@1): Proportion of tasks solved at least once by one sample,

$\text{SuccessRate} = \frac{\#\text{tasks solved (flag submitted correctly)}}{\#\text{total tasks}}$

Additional evaluation includes tracking stuck-in-loop rates (repetitive agent actions), average step length, and time-to-solve where appropriate (Zhuo et al., 29 Jul 2025, Zhuo et al., 25 Aug 2025, Turtayev et al., 2024, Yang et al., 2023).

4. Agent Architectures and Methodologies

Earlier iterations utilized static prompt-to-code or single-shot code generation, yielding near-zero solve rates. All modern approaches employ deeply interactive paradigms:

ReAct Loop: Agents execute chain-of-thought reasoning—separating “thought” and “action”—with each action yielding new observations (Abramovich et al., 2024, Turtayev et al., 2024).
Interactive Agent Tools (IATs): Tool interfaces (debug_start, connect_sendline, create/edit for scripts) enable scripted or multi-modal exploitation without blocking the main shell. For example, gdb debugging or specialized server connections are invoked as non-blocking submodules (Abramovich et al., 2024).
Action–Observation History and Planning: Maintaining a moving window of past actions/observations enhances long-horizon performance; explicit re-planning injections (at fixed turn intervals) unlock classes of tasks previously unsolved by pure reactive modes (Turtayev et al., 2024).
Runtime-Free Trajectory Synthesis: Recent systems (e.g., Cyber-Zero) employ dual-persona LLMs to synthesize multi-turn agent-environment dialogues from public CTF writeups, bypassing the need for ephemeral or restricted runtime environments (Zhuo et al., 29 Jul 2025).

A representative workflow for a reverse engineering challenge includes discovering and inspecting executables, decompilation/disassembly, interactive debugging, dynamic input fuzzing, scripting decoders, and flag submission.

5. Quantitative Results and Comparative Performance

InterCode-CTF has catalyzed rapid progress in LLM-driven CTF solvers:

Approach / Agent	Tasks (#)	Pass@1 / SuccessRate	Multi-Attempt Pass@k	Citation
Original InterCode (Yang)	100	40%	—	(Yang et al., 2023)
DeepMind Agent (Phuong)	81	29%	—	(Turtayev et al., 2024)
EnIGMA (Abramovich)	100	72% (GPT-4 Turbo)	—	(Abramovich et al., 2024)
Turtayev et al. (ReAct&Plan)	85	89% (@1), 95% (@5)	Pass@5 = 95%	(Turtayev et al., 2024)
CTF-Dojo-32B	91	83.5%	91.0% (estimated)	(Zhuo et al., 25 Aug 2025)
Cyber-Zero-32B	91	82.4%	—	(Zhuo et al., 29 Jul 2025)
KryptoPilot (crypto only)	18 (crypto)	100%	—	(Liu et al., 14 Jan 2026)

Empirical studies confirm that baseline prompt engineering (static agents) yields ≤40% pass rates, while judicious tool integration, chain-of-thought prompting, multi-attempt protocols, and execution-grounded training result in pass rates exceeding 90% on the canonical suite. Ablations show 2–3 percentage point drops in removing agent tools or output summarization (Turtayev et al., 2024, Zhuo et al., 25 Aug 2025).

Full-saturation is demonstrated by Turtayev et al., whose ReAct&Plan agent achieves 95% pass@5, with ablations underscoring the necessity of multi-attempts, full action-observation memory, expanded CLI/Python tooling, and structured output formatting (Turtayev et al., 2024). CTF-Dojo and Cyber-Zero highlight the impact of trajectory data collection and execution-grounded feedback loops for open-weight models (Zhuo et al., 25 Aug 2025, Zhuo et al., 29 Jul 2025).

6. Task-Level Insights, Failure Modes, and Research Directions

Representative InterCode-CTF exploit strategies include:

Multi-step Cryptanalysis: Agents orchestrate Python snippets for modular inversion, factorization, and plaintext recovery (e.g., reconstructing RSA private keys followed by decryption) (Zhuo et al., 25 Aug 2025, Zhuo et al., 29 Jul 2025).
Reverse Engineering: Sequences integrate decompilation, dynamic breakpoints, memory inspection, and input fuzzing in tightly-coupled agent reasoning cycles (Abramovich et al., 2024).
Web and Pwn: Intrinsic need for command orchestration across shell, debugger, and server interfaces necessitates robust tool support (e.g., patching templates, rebooting servers, or chaining exploits via multiple commands).
Forensics: Workflow illustrates the necessity of domain-specific utilities (e.g., Sleuthkit’s fls/icat), with agents required to install and invoke packages within the container (Yang et al., 2023).

Common failure modes include cost-budget exhaustion, irrecoverable dead-ends with limited re-planning, toolchain installation breaks, and rare contamination/flag-memorization episodes from training data (Abramovich et al., 2024, Turtayev et al., 2024). For higher-difficulty cryptography, KryptoPilot demonstrates the necessity of external open-world retrieval for precise attack modeling; coarse-grained or “just-in-case” knowledge is insufficient (Liu et al., 14 Jan 2026).

Recommendations focus on automated challenge ingestion from live CTFs, incorporating partial-flag/coverage-based RL signals, expanding to professional and kernel-level exploitation, and developing robust data distillation techniques from public writeups without flag leakage (Zhuo et al., 25 Aug 2025).

7. Historical Evolution and Broader Context

InterCode-CTF evolved from the InterCode framework for interactive code generation, extending the core POMDP formalism for multi-step code–execution tasks within reproducible Docker environments (Yang et al., 2023). Early agent approaches failed on CTF, highlighting the interaction gap in code benchmarks compared to real-world adversarial reasoning. Since then, a methodological shift from one-shot to deeply interactive, tool-augmented, and trajectory-intensive paradigms has driven state-of-the-art advances. InterCode-CTF now underpins several lines of research in agent alignment, RLHF for code, security evaluation, and synthetic data generation, and continues to serve as both a proving ground and diagnostic suite for scalable machine reasoning in security contexts.