Papers
Topics
Authors
Recent
Assistant
AI Research Assistant
Well-researched responses based on relevant abstracts and paper content.
Custom Instructions Pro
Preferences or requirements that you'd like Emergent Mind to consider when generating responses.
GPT-5.1
GPT-5.1 91 tok/s
Gemini 3.0 Pro 46 tok/s Pro
Gemini 2.5 Flash 148 tok/s Pro
Kimi K2 170 tok/s Pro
Claude Sonnet 4.5 34 tok/s Pro
2000 character limit reached

InterCode-CTF Benchmark

Updated 19 November 2025
  • InterCode-CTF is a container-based cybersecurity benchmark that standardizes interactive Capture-The-Flag challenges for evaluating automated agents.
  • It models each challenge as a partially observable Markov Decision Process, defining clear state, action, and observation spaces with fixed turn limits.
  • The benchmark drives advances in agent design by assessing strategies like zero-shot prompting and iterative planning via metrics such as Pass@1.

InterCode-CTF is a widely adopted benchmark designed to evaluate the capability of automated agents and LLMs to solve multi-step, interactive Capture-The-Flag (CTF) cybersecurity challenges within a reproducible, execution-driven environment. Derived from picoCTF—a high-school and undergraduate-level CTF competition—InterCode-CTF offers a standardized, container-based suite of tasks, enabling rigorous assessment of agent planning, tool use, vulnerability exploitation, and iterative problem solving. The benchmark has become a de facto yardstick for offensive security agents and is integral to numerous recent advances in agent-driven code intelligence and security research.

1. Benchmark Composition and Task Structure

InterCode-CTF consists of a curated set of CTF challenges sourced from the picoCTF archive, with the selection designed to balance category coverage and reproducibility. The canonical instantiations comprise between 85 and 100 tasks, after filtering out instances requiring vision, external internet access, or with broken containers (Turtayev et al., 3 Dec 2024).

Category distribution across the main variants:

Category # Tasks (typical) Description
Crypto 16–19 Cryptography, including RSA, group theory
Forensics 13–15 File carving, steganography, network analysis
Binary Exploitation 2–4 "Pwn" tasks—buffer overflows, stack exploits
Reverse Engineering 27 ELF analysis, static/dynamic inspection
Web Exploitation 2 Simple webserver flaws, network endpoints
Miscellaneous 31–33 Scripting, general skills, logic puzzles

Each challenge is packaged as an isolated Docker container, including the following components (Yang et al., 2023, Abramovich et al., 24 Sep 2024, Zhuo et al., 25 Aug 2025):

  • Problem Statement: Natural-language instructions describing the task and desired goal (typically “find and submit the flag”).
  • Artifacts: Binaries, scripts, data files (images, PCAPs), or remote access endpoints.
  • Execution Environment: Pre-installed Linux utilities (e.g., gdb, binwalk, tshark), scripting languages, and occasionally category-specific tools (e.g., RsaCtfTool for Crypto).
  • Flag: A hidden “golden” string; static or dynamically generated, submission required for task completion.
  • Interaction Scaffold: Agents interact via pre-defined commands (e.g., ls, decompile, debug_start, connect_start), a controlled action interface (bash/Python shell), and can submit the extracted flag for verification.

Tasks are capped at a fixed number of interaction turns (commonly 30–40), and category assignments follow the original picoCTF taxonomy (Turtayev et al., 3 Dec 2024, Yang et al., 2023).

2. Formal Environment Specification

InterCode-CTF models each challenge as a partially observable Markov Decision Process (POMDP) (Yang et al., 2023):

  • State Space (S\mathcal{S}): Full container filesystem and process state, plus a flag-discovery indicator.
  • Action Space (A\mathcal{A}): Admissible shell or Python commands and the flag submission action; each action must be syntactically valid.
  • Observation Space (O\mathcal{O}): Pair (stdout,Δfs)(\texttt{stdout},\,\Delta\texttt{fs})—command output and record of filesystem mutations.
  • Transition Function (T\mathcal{T}): Deterministic application of agent commands in the containerized OS context.
  • Reward Function (R\mathcal{R}): Sparse (+1 for correct flag submission), with optional negative rewards for invalid commands and shaped rewards for uncovering subflags.
  • Episode Termination: On correct flag submission or after exceeding the turn budget.

This formalization supports reinforcement learning (RL), imitation learning, as well as scripted and prompt-based agent strategies (Yang et al., 2023).

3. Evaluation Protocols and Metrics

The primary metric is Pass@1 (success rate on the first attempt), where a challenge is considered solved if and only if the agent submits exactly the golden flag during its trajectory (Abramovich et al., 24 Sep 2024, Zhuo et al., 25 Aug 2025, Turtayev et al., 3 Dec 2024). Extensions include Pass@k for multiple attempts with environment resets.

Mathematically, for NN challenges:

Pass@1=1Ni=1Nsi\mathrm{Pass@1} = \frac{1}{N}\sum_{i=1}^N s_i

where si=1s_i=1 if flag successfully submitted on challenge ii, $0$ otherwise. For kk independent attempts, the per-task probability is Pi(k)=1[1Pi(1)]kP_i(k) = 1 - [1 - P_i(1)]^k.

Secondary metrics include:

  • Average Steps to Flag: Mean number of actions until flag submission.
  • Error Rate: Fraction of non-admissible (syntax-error) commands.
  • Category Breakdown: Per-category Pass@1, revealing agent strengths/weaknesses.
  • Granular Status Codes: Success, budget exhausted, context overflow, forfeit, error (Abramovich et al., 24 Sep 2024).

Constraints are imposed via strict generation budgets ($\leq\$3$ per instance in some studies), fixed turn limits, and tool usage restrictions. Reproducibility is ensured by deterministic Docker images and orchestrated evaluation scripts (Yang et al., 2023, Abramovich et al., 24 Sep 2024).

4. Baseline Agent Architectures and Comparative Results

Multiple agent paradigms have been evaluated on InterCode-CTF:

Notable aggregate scores:

Agent Pass@1 (%) Attempt Spec. Reference
InterCode (2023, zero-shot) 25–40 @1 (Yang et al., 2023, Turtayev et al., 3 Dec 2024)
EnIGMA (GPT-4o) 69 @1 (Abramovich et al., 24 Sep 2024)
EnIGMA (GPT-4 Turbo) 72 @1 (Abramovich et al., 24 Sep 2024)
ReActPlan (GPT-4o, o1-preview) 89 @1 (Turtayev et al., 3 Dec 2024)
ReActPlan 95 @5 (Turtayev et al., 3 Dec 2024)
CTF-Dojo-32B 83.5 @1 (Zhuo et al., 25 Aug 2025)
Cyber-Zero-32B 82.4 @1 (Zhuo et al., 29 Jul 2025)
DeepSeek-V3-0324 (zero-shot) 82.5 @1 (Zhuo et al., 25 Aug 2025)

Category-wise performance reveals strong agents (ReActPlan@5) attain 100% on general skills and web, ~96% on reverse engineering, and >90% on cryptography and forensics, with only vision-based or internet-dependent tasks forming persistent failures (Turtayev et al., 3 Dec 2024).

5. Technical Insights and Observed Failure Modes

Empirical studies highlight several key determinants of performance:

  • Active Tool Use: Integrating category-specific binaries and debuggers is critical; omitting these tools drops solve rate by ~2.5 percentage points overall, with cryptography and binary exploitation most impacted (Abramovich et al., 24 Sep 2024).
  • Summarization: LM-driven output summarizers outperform naive or no summarization, preventing context overflow and increasing success (Abramovich et al., 24 Sep 2024).
  • Trajectory Length and Recovery: Long-horizon, multi-turn interactions (64.8% Pass@1) outperform single-turn demonstrations (25.3%), primarily by decreasing stuck-in-loop rates (11.1% vs. 73.5%) (Zhuo et al., 29 Jul 2025).
  • Multiple Independent Attempts: Allowing k>1k>1 attempts (with resets) enables near-saturation (95%) by correcting for action mis-ranking and exploration variance.

However, InterCode-CTF exposes several open limitations:

These findings motivate the creation of future, harder benchmarks with private or obfuscated challenge sets, integrated web/vision interfaces, and stricter data curation.

6. Impact and Research Significance

InterCode-CTF has become a central fixture for agent-based cybersecurity research, AI4Sec competitions, and LLM evaluations. Key impacts include:

  • Standardization: Provides a reproducible, extensible RL-style environment with support for new CTF challenges, tool augmentations, and reward shaping (Yang et al., 2023).
  • Innovation Driver: Enabled the development and validation of execution-grounded agent learning methodologies, fine-tuning strategies such as CTF-Dojo and Cyber-Zero, and detailed analysis of tool/plan chaining effects (Zhuo et al., 25 Aug 2025, Zhuo et al., 29 Jul 2025).
  • Curriculum and Sensitivity Studies: The structure inspired benchmarks like CTF-Code, which targets sensitivity to problem detail via counterfactual perturbations, and CTF-Instruct, which enhances LLM generalization and robustness (Luo et al., 20 May 2025).

Limitations include its “high school” challenge level—now saturated by plain LLM agents using modest prompting and tool selection—necessitating more sophisticated future benchmarks (e.g., NYU CTF Bench, HackTheBox) to track continued advances (Turtayev et al., 3 Dec 2024). Nevertheless, InterCode-CTF remains the reference suite for diagnostic, ablation, and transfer learning studies on interactive exploit discovery and agent robustness.

7. Extensibility, Best Practices, and Future Directions

The architecture of InterCode-CTF supports easy addition of new challenges via Docker image and dataset extension, reward/observation augmentation, and custom agent-computer interfaces (Yang et al., 2023). Recommendations for benchmark evolution, drawn from empirical studies, include:

  • Broadening ACI coverage (browser, HTTP, database tools) for expanded challenge domains (Abramovich et al., 24 Sep 2024).
  • Strengthening privacy/obfuscation measures to prevent model contamination and leakage.
  • Structuring multi-stage dependencies and cross-challenge memory to test long-horizon reasoning and generalization.
  • Deploying benchmarking infrastructure to support continuous scoreboard updates and per-category analyses as new models and strategies emerge.

In sum, InterCode-CTF continues to influence both methodological advances in security-oriented LLMs and the broader design of reproducible, execution-driven agent benchmarks in applied machine learning and cybersecurity research (Yang et al., 2023, Abramovich et al., 24 Sep 2024, Turtayev et al., 3 Dec 2024, Zhuo et al., 25 Aug 2025).

Forward Email Streamline Icon: https://streamlinehq.com

Follow Topic

Get notified by email when new papers are published related to InterCode-CTF Benchmark.