InterCode-CTF Benchmark

Updated 19 November 2025

InterCode-CTF is a container-based cybersecurity benchmark that standardizes interactive Capture-The-Flag challenges for evaluating automated agents.
It models each challenge as a partially observable Markov Decision Process, defining clear state, action, and observation spaces with fixed turn limits.
The benchmark drives advances in agent design by assessing strategies like zero-shot prompting and iterative planning via metrics such as Pass@1.

InterCode-CTF is a widely adopted benchmark designed to evaluate the capability of automated agents and LLMs to solve multi-step, interactive Capture-The-Flag (CTF) cybersecurity challenges within a reproducible, execution-driven environment. Derived from picoCTF—a high-school and undergraduate-level CTF competition—InterCode-CTF offers a standardized, container-based suite of tasks, enabling rigorous assessment of agent planning, tool use, vulnerability exploitation, and iterative problem solving. The benchmark has become a de facto yardstick for offensive security agents and is integral to numerous recent advances in agent-driven code intelligence and security research.

1. Benchmark Composition and Task Structure

InterCode-CTF consists of a curated set of CTF challenges sourced from the picoCTF archive, with the selection designed to balance category coverage and reproducibility. The canonical instantiations comprise between 85 and 100 tasks, after filtering out instances requiring vision, external internet access, or with broken containers (Turtayev et al., 2024).

Category distribution across the main variants:

Category	# Tasks (typical)	Description
Crypto	16–19	Cryptography, including RSA, group theory
Forensics	13–15	File carving, steganography, network analysis
Binary Exploitation	2–4	"Pwn" tasks—buffer overflows, stack exploits
Reverse Engineering	27	ELF analysis, static/dynamic inspection
Web Exploitation	2	Simple webserver flaws, network endpoints
Miscellaneous	31–33	Scripting, general skills, logic puzzles

Each challenge is packaged as an isolated Docker container, including the following components (Yang et al., 2023, Abramovich et al., 2024, Zhuo et al., 25 Aug 2025):

Problem Statement: Natural-language instructions describing the task and desired goal (typically “find and submit the flag”).
Artifacts: Binaries, scripts, data files (images, PCAPs), or remote access endpoints.
Execution Environment: Pre-installed Linux utilities (e.g., gdb, binwalk, tshark), scripting languages, and occasionally category-specific tools (e.g., RsaCtfTool for Crypto).
Flag: A hidden “golden” string; static or dynamically generated, submission required for task completion.
Interaction Scaffold: Agents interact via pre-defined commands (e.g., ls, decompile, debug_start, connect_start), a controlled action interface (bash/Python shell), and can submit the extracted flag for verification.

Tasks are capped at a fixed number of interaction turns (commonly 30–40), and category assignments follow the original picoCTF taxonomy (Turtayev et al., 2024, Yang et al., 2023).

2. Formal Environment Specification

InterCode-CTF models each challenge as a partially observable Markov Decision Process (POMDP) (Yang et al., 2023):

State Space ( $\mathcal{S}$ ): Full container filesystem and process state, plus a flag-discovery indicator.
Action Space ( $\mathcal{A}$ ): Admissible shell or Python commands and the flag submission action; each action must be syntactically valid.
Observation Space ( $\mathcal{O}$ ): Pair $(\texttt{stdout},\,\Delta\texttt{fs})$ —command output and record of filesystem mutations.
Transition Function ( $\mathcal{T}$ ): Deterministic application of agent commands in the containerized OS context.
Reward Function ( $\mathcal{R}$ ): Sparse (+1 for correct flag submission), with optional negative rewards for invalid commands and shaped rewards for uncovering subflags.
Episode Termination: On correct flag submission or after exceeding the turn budget.

This formalization supports reinforcement learning (RL), imitation learning, as well as scripted and prompt-based agent strategies (Yang et al., 2023).

3. Evaluation Protocols and Metrics

The primary metric is Pass@1 (success rate on the first attempt), where a challenge is considered solved if and only if the agent submits exactly the golden flag during its trajectory (Abramovich et al., 2024, Zhuo et al., 25 Aug 2025, Turtayev et al., 2024). Extensions include Pass@k for multiple attempts with environment resets.

Mathematically, for $N$ challenges:

$\mathrm{Pass@1} = \frac{1}{N}\sum_{i=1}^N s_i$

where $s_i=1$ if flag successfully submitted on challenge $i$ , $0$ otherwise. For $k$ independent attempts, the per-task probability is $P_i(k) = 1 - [1 - P_i(1)]^k$ .

Secondary metrics include:

Average Steps to Flag: Mean number of actions until flag submission.
Error Rate: Fraction of non-admissible (syntax-error) commands.
Category Breakdown: Per-category Pass@1, revealing agent strengths/weaknesses.
Granular Status Codes: Success, budget exhausted, context overflow, forfeit, error (Abramovich et al., 2024).

Constraints are imposed via strict generation budgets ($\leq\$3$ per instance in some studies), fixed turn limits, and tool usage restrictions. Reproducibility is ensured by deterministic Docker images and orchestrated evaluation scripts (Yang et al., 2023, Abramovich et al., 2024).

4. Baseline Agent Architectures and Comparative Results

Multiple agent paradigms have been evaluated on InterCode-CTF:

Zero-shot Prompting: Single-pass, static prompt generation; poor performance (25–47% Pass@1) (Yang et al., 2023, Turtayev et al., 2024).
Iterative/Chain-of-Thought (ReAct): Alternating “thought” and “action” steps, using agent-environment stepwise feedback. Significant improvement up to 83% Pass@1 (Turtayev et al., 2024).
Plan-and-Solve: High-level planning followed by sequential execution; intermediate improvement (65% Pass@1) (Turtayev et al., 2024).
ReAct→Plan: Interleaving ReAct with planning step, further enhanced with strong LLMs for replanning, peaks at 89% Pass@1 on one attempt and 95% across five attempts (Turtayev et al., 2024).
EnIGMA: Custom interactive tools (debugger, connect) and LM-driven summarization modules; performances up to 72% Pass@1 on full 100-task suite (Abramovich et al., 2024).
CTF-Dojo/Cyber-Zero: LLM-based agents fine-tuned on execution-grounded or synthetic trajectory datasets, reaching 83.5% (CTF-Dojo-32B) and 82.4% (Cyber-Zero-32B) (Zhuo et al., 25 Aug 2025, Zhuo et al., 29 Jul 2025).

Notable aggregate scores:

Agent	Pass@1 (%)	Attempt Spec.	Reference
InterCode (2023, zero-shot)	25–40	@1	(Yang et al., 2023, Turtayev et al., 2024)
EnIGMA (GPT-4o)	69	@1	(Abramovich et al., 2024)
EnIGMA (GPT-4 Turbo)	72	@1	(Abramovich et al., 2024)
ReActPlan (GPT-4o, o1-preview)	89	@1	(Turtayev et al., 2024)
ReActPlan	95	@5	(Turtayev et al., 2024)
CTF-Dojo-32B	83.5	@1	(Zhuo et al., 25 Aug 2025)
Cyber-Zero-32B	82.4	@1	(Zhuo et al., 29 Jul 2025)
DeepSeek-V3-0324 (zero-shot)	82.5	@1	(Zhuo et al., 25 Aug 2025)

Category-wise performance reveals strong agents (ReActPlan@5) attain 100% on general skills and web, ~96% on reverse engineering, and >90% on cryptography and forensics, with only vision-based or internet-dependent tasks forming persistent failures (Turtayev et al., 2024).

5. Technical Insights and Observed Failure Modes

Empirical studies highlight several key determinants of performance:

Active Tool Use: Integrating category-specific binaries and debuggers is critical; omitting these tools drops solve rate by ~2.5 percentage points overall, with cryptography and binary exploitation most impacted (Abramovich et al., 2024).
Summarization: LM-driven output summarizers outperform naive or no summarization, preventing context overflow and increasing success (Abramovich et al., 2024).
Trajectory Length and Recovery: Long-horizon, multi-turn interactions (64.8% Pass@1) outperform single-turn demonstrations (25.3%), primarily by decreasing stuck-in-loop rates (11.1% vs. 73.5%) (Zhuo et al., 29 Jul 2025).
Multiple Independent Attempts: Allowing $k>1$ attempts (with resets) enables near-saturation (95%) by correcting for action mis-ranking and exploration variance.

However, InterCode-CTF exposes several open limitations:

Data leakage: A nontrivial proportion of flags appear to be directly memorized by some foundation models (e.g., 14% of Claude 3.5 Sonnet runs), undermining benchmark validity (Abramovich et al., 2024, Turtayev et al., 2024).
Soliloquizing: Models sometimes fabricate non-existent observations in absence of environmental cues (Abramovich et al., 2024).
Vision and Networking Gaps: Tasks requiring image analysis or browser/HTTP APIs remain unsolved by most agents (Turtayev et al., 2024).
Population Memorization: Simple “blind submission” strategies achieved 10% solves, suggesting benchmark contamination or overfitting (Turtayev et al., 2024).

These findings motivate the creation of future, harder benchmarks with private or obfuscated challenge sets, integrated web/vision interfaces, and stricter data curation.

6. Impact and Research Significance

InterCode-CTF has become a central fixture for agent-based cybersecurity research, AI4Sec competitions, and LLM evaluations. Key impacts include:

Standardization: Provides a reproducible, extensible RL-style environment with support for new CTF challenges, tool augmentations, and reward shaping (Yang et al., 2023).
Innovation Driver: Enabled the development and validation of execution-grounded agent learning methodologies, fine-tuning strategies such as CTF-Dojo and Cyber-Zero, and detailed analysis of tool/plan chaining effects (Zhuo et al., 25 Aug 2025, Zhuo et al., 29 Jul 2025).
Curriculum and Sensitivity Studies: The structure inspired benchmarks like CTF-Code, which targets sensitivity to problem detail via counterfactual perturbations, and CTF-Instruct, which enhances LLM generalization and robustness (Luo et al., 20 May 2025).

Limitations include its “high school” challenge level—now saturated by plain LLM agents using modest prompting and tool selection—necessitating more sophisticated future benchmarks (e.g., NYU CTF Bench, HackTheBox) to track continued advances (Turtayev et al., 2024). Nevertheless, InterCode-CTF remains the reference suite for diagnostic, ablation, and transfer learning studies on interactive exploit discovery and agent robustness.

7. Extensibility, Best Practices, and Future Directions

The architecture of InterCode-CTF supports easy addition of new challenges via Docker image and dataset extension, reward/observation augmentation, and custom agent-computer interfaces (Yang et al., 2023). Recommendations for benchmark evolution, drawn from empirical studies, include:

Broadening ACI coverage (browser, HTTP, database tools) for expanded challenge domains (Abramovich et al., 2024).
Strengthening privacy/obfuscation measures to prevent model contamination and leakage.
Structuring multi-stage dependencies and cross-challenge memory to test long-horizon reasoning and generalization.
Deploying benchmarking infrastructure to support continuous scoreboard updates and per-category analyses as new models and strategies emerge.

In sum, InterCode-CTF continues to influence both methodological advances in security-oriented LLMs and the broader design of reproducible, execution-driven agent benchmarks in applied machine learning and cybersecurity research (Yang et al., 2023, Abramovich et al., 2024, Turtayev et al., 2024, Zhuo et al., 25 Aug 2025).