InterCode-CTF Benchmark
- InterCode-CTF is a container-based cybersecurity benchmark that standardizes interactive Capture-The-Flag challenges for evaluating automated agents.
- It models each challenge as a partially observable Markov Decision Process, defining clear state, action, and observation spaces with fixed turn limits.
- The benchmark drives advances in agent design by assessing strategies like zero-shot prompting and iterative planning via metrics such as Pass@1.
InterCode-CTF is a widely adopted benchmark designed to evaluate the capability of automated agents and LLMs to solve multi-step, interactive Capture-The-Flag (CTF) cybersecurity challenges within a reproducible, execution-driven environment. Derived from picoCTF—a high-school and undergraduate-level CTF competition—InterCode-CTF offers a standardized, container-based suite of tasks, enabling rigorous assessment of agent planning, tool use, vulnerability exploitation, and iterative problem solving. The benchmark has become a de facto yardstick for offensive security agents and is integral to numerous recent advances in agent-driven code intelligence and security research.
1. Benchmark Composition and Task Structure
InterCode-CTF consists of a curated set of CTF challenges sourced from the picoCTF archive, with the selection designed to balance category coverage and reproducibility. The canonical instantiations comprise between 85 and 100 tasks, after filtering out instances requiring vision, external internet access, or with broken containers (Turtayev et al., 3 Dec 2024).
Category distribution across the main variants:
| Category | # Tasks (typical) | Description |
|---|---|---|
| Crypto | 16–19 | Cryptography, including RSA, group theory |
| Forensics | 13–15 | File carving, steganography, network analysis |
| Binary Exploitation | 2–4 | "Pwn" tasks—buffer overflows, stack exploits |
| Reverse Engineering | 27 | ELF analysis, static/dynamic inspection |
| Web Exploitation | 2 | Simple webserver flaws, network endpoints |
| Miscellaneous | 31–33 | Scripting, general skills, logic puzzles |
Each challenge is packaged as an isolated Docker container, including the following components (Yang et al., 2023, Abramovich et al., 24 Sep 2024, Zhuo et al., 25 Aug 2025):
- Problem Statement: Natural-language instructions describing the task and desired goal (typically “find and submit the flag”).
- Artifacts: Binaries, scripts, data files (images, PCAPs), or remote access endpoints.
- Execution Environment: Pre-installed Linux utilities (e.g., gdb, binwalk, tshark), scripting languages, and occasionally category-specific tools (e.g., RsaCtfTool for Crypto).
- Flag: A hidden “golden” string; static or dynamically generated, submission required for task completion.
- Interaction Scaffold: Agents interact via pre-defined commands (e.g.,
ls,decompile,debug_start,connect_start), a controlled action interface (bash/Python shell), and can submit the extracted flag for verification.
Tasks are capped at a fixed number of interaction turns (commonly 30–40), and category assignments follow the original picoCTF taxonomy (Turtayev et al., 3 Dec 2024, Yang et al., 2023).
2. Formal Environment Specification
InterCode-CTF models each challenge as a partially observable Markov Decision Process (POMDP) (Yang et al., 2023):
- State Space (): Full container filesystem and process state, plus a flag-discovery indicator.
- Action Space (): Admissible shell or Python commands and the flag submission action; each action must be syntactically valid.
- Observation Space (): Pair —command output and record of filesystem mutations.
- Transition Function (): Deterministic application of agent commands in the containerized OS context.
- Reward Function (): Sparse (+1 for correct flag submission), with optional negative rewards for invalid commands and shaped rewards for uncovering subflags.
- Episode Termination: On correct flag submission or after exceeding the turn budget.
This formalization supports reinforcement learning (RL), imitation learning, as well as scripted and prompt-based agent strategies (Yang et al., 2023).
3. Evaluation Protocols and Metrics
The primary metric is Pass@1 (success rate on the first attempt), where a challenge is considered solved if and only if the agent submits exactly the golden flag during its trajectory (Abramovich et al., 24 Sep 2024, Zhuo et al., 25 Aug 2025, Turtayev et al., 3 Dec 2024). Extensions include Pass@k for multiple attempts with environment resets.
Mathematically, for challenges:
where if flag successfully submitted on challenge , $0$ otherwise. For independent attempts, the per-task probability is .
Secondary metrics include:
- Average Steps to Flag: Mean number of actions until flag submission.
- Error Rate: Fraction of non-admissible (syntax-error) commands.
- Category Breakdown: Per-category Pass@1, revealing agent strengths/weaknesses.
- Granular Status Codes: Success, budget exhausted, context overflow, forfeit, error (Abramovich et al., 24 Sep 2024).
Constraints are imposed via strict generation budgets ($\leq\$3$ per instance in some studies), fixed turn limits, and tool usage restrictions. Reproducibility is ensured by deterministic Docker images and orchestrated evaluation scripts (Yang et al., 2023, Abramovich et al., 24 Sep 2024).
4. Baseline Agent Architectures and Comparative Results
Multiple agent paradigms have been evaluated on InterCode-CTF:
- Zero-shot Prompting: Single-pass, static prompt generation; poor performance (25–47% Pass@1) (Yang et al., 2023, Turtayev et al., 3 Dec 2024).
- Iterative/Chain-of-Thought (ReAct): Alternating “thought” and “action” steps, using agent-environment stepwise feedback. Significant improvement up to 83% Pass@1 (Turtayev et al., 3 Dec 2024).
- Plan-and-Solve: High-level planning followed by sequential execution; intermediate improvement (65% Pass@1) (Turtayev et al., 3 Dec 2024).
- ReAct→Plan: Interleaving ReAct with planning step, further enhanced with strong LLMs for replanning, peaks at 89% Pass@1 on one attempt and 95% across five attempts (Turtayev et al., 3 Dec 2024).
- EnIGMA: Custom interactive tools (debugger, connect) and LM-driven summarization modules; performances up to 72% Pass@1 on full 100-task suite (Abramovich et al., 24 Sep 2024).
- CTF-Dojo/Cyber-Zero: LLM-based agents fine-tuned on execution-grounded or synthetic trajectory datasets, reaching 83.5% (CTF-Dojo-32B) and 82.4% (Cyber-Zero-32B) (Zhuo et al., 25 Aug 2025, Zhuo et al., 29 Jul 2025).
Notable aggregate scores:
| Agent | Pass@1 (%) | Attempt Spec. | Reference |
|---|---|---|---|
| InterCode (2023, zero-shot) | 25–40 | @1 | (Yang et al., 2023, Turtayev et al., 3 Dec 2024) |
| EnIGMA (GPT-4o) | 69 | @1 | (Abramovich et al., 24 Sep 2024) |
| EnIGMA (GPT-4 Turbo) | 72 | @1 | (Abramovich et al., 24 Sep 2024) |
| ReActPlan (GPT-4o, o1-preview) | 89 | @1 | (Turtayev et al., 3 Dec 2024) |
| ReActPlan | 95 | @5 | (Turtayev et al., 3 Dec 2024) |
| CTF-Dojo-32B | 83.5 | @1 | (Zhuo et al., 25 Aug 2025) |
| Cyber-Zero-32B | 82.4 | @1 | (Zhuo et al., 29 Jul 2025) |
| DeepSeek-V3-0324 (zero-shot) | 82.5 | @1 | (Zhuo et al., 25 Aug 2025) |
Category-wise performance reveals strong agents (ReActPlan@5) attain 100% on general skills and web, ~96% on reverse engineering, and >90% on cryptography and forensics, with only vision-based or internet-dependent tasks forming persistent failures (Turtayev et al., 3 Dec 2024).
5. Technical Insights and Observed Failure Modes
Empirical studies highlight several key determinants of performance:
- Active Tool Use: Integrating category-specific binaries and debuggers is critical; omitting these tools drops solve rate by ~2.5 percentage points overall, with cryptography and binary exploitation most impacted (Abramovich et al., 24 Sep 2024).
- Summarization: LM-driven output summarizers outperform naive or no summarization, preventing context overflow and increasing success (Abramovich et al., 24 Sep 2024).
- Trajectory Length and Recovery: Long-horizon, multi-turn interactions (64.8% Pass@1) outperform single-turn demonstrations (25.3%), primarily by decreasing stuck-in-loop rates (11.1% vs. 73.5%) (Zhuo et al., 29 Jul 2025).
- Multiple Independent Attempts: Allowing attempts (with resets) enables near-saturation (95%) by correcting for action mis-ranking and exploration variance.
However, InterCode-CTF exposes several open limitations:
- Data leakage: A nontrivial proportion of flags appear to be directly memorized by some foundation models (e.g., 14% of Claude 3.5 Sonnet runs), undermining benchmark validity (Abramovich et al., 24 Sep 2024, Turtayev et al., 3 Dec 2024).
- Soliloquizing: Models sometimes fabricate non-existent observations in absence of environmental cues (Abramovich et al., 24 Sep 2024).
- Vision and Networking Gaps: Tasks requiring image analysis or browser/HTTP APIs remain unsolved by most agents (Turtayev et al., 3 Dec 2024).
- Population Memorization: Simple “blind submission” strategies achieved 10% solves, suggesting benchmark contamination or overfitting (Turtayev et al., 3 Dec 2024).
These findings motivate the creation of future, harder benchmarks with private or obfuscated challenge sets, integrated web/vision interfaces, and stricter data curation.
6. Impact and Research Significance
InterCode-CTF has become a central fixture for agent-based cybersecurity research, AI4Sec competitions, and LLM evaluations. Key impacts include:
- Standardization: Provides a reproducible, extensible RL-style environment with support for new CTF challenges, tool augmentations, and reward shaping (Yang et al., 2023).
- Innovation Driver: Enabled the development and validation of execution-grounded agent learning methodologies, fine-tuning strategies such as CTF-Dojo and Cyber-Zero, and detailed analysis of tool/plan chaining effects (Zhuo et al., 25 Aug 2025, Zhuo et al., 29 Jul 2025).
- Curriculum and Sensitivity Studies: The structure inspired benchmarks like CTF-Code, which targets sensitivity to problem detail via counterfactual perturbations, and CTF-Instruct, which enhances LLM generalization and robustness (Luo et al., 20 May 2025).
Limitations include its “high school” challenge level—now saturated by plain LLM agents using modest prompting and tool selection—necessitating more sophisticated future benchmarks (e.g., NYU CTF Bench, HackTheBox) to track continued advances (Turtayev et al., 3 Dec 2024). Nevertheless, InterCode-CTF remains the reference suite for diagnostic, ablation, and transfer learning studies on interactive exploit discovery and agent robustness.
7. Extensibility, Best Practices, and Future Directions
The architecture of InterCode-CTF supports easy addition of new challenges via Docker image and dataset extension, reward/observation augmentation, and custom agent-computer interfaces (Yang et al., 2023). Recommendations for benchmark evolution, drawn from empirical studies, include:
- Broadening ACI coverage (browser, HTTP, database tools) for expanded challenge domains (Abramovich et al., 24 Sep 2024).
- Strengthening privacy/obfuscation measures to prevent model contamination and leakage.
- Structuring multi-stage dependencies and cross-challenge memory to test long-horizon reasoning and generalization.
- Deploying benchmarking infrastructure to support continuous scoreboard updates and per-category analyses as new models and strategies emerge.
In sum, InterCode-CTF continues to influence both methodological advances in security-oriented LLMs and the broader design of reproducible, execution-driven agent benchmarks in applied machine learning and cybersecurity research (Yang et al., 2023, Abramovich et al., 24 Sep 2024, Turtayev et al., 3 Dec 2024, Zhuo et al., 25 Aug 2025).
Sponsored by Paperpile, the PDF & BibTeX manager trusted by top AI labs.
Get 30 days free