Papers
Topics
Authors
Recent
2000 character limit reached

Jeopardy-Style CTF Competitions

Updated 3 December 2025
  • Jeopardy-style CTFs are cybersecurity competitions featuring category-labeled challenges solved by flag extraction to assess human and AI exploit capabilities.
  • They employ structured scoring models with fixed points, time-decay penalties, and automated validation via platforms like CTF-Dojo and CTFTiny.
  • Despite advanced AI performance, these competitions continue to provide practical training, systematic benchmarking, and educational insights for cybersecurity.

Jeopardy-style Capture The Flag (CTF) competitions are a foundational format in the cybersecurity ecosystem, serving as both training grounds and benchmarks for human and AI agents’ exploit capabilities. Defined by a board of independent, category-labeled challenges—each solved by extracting a "flag" string—Jeopardy CTFs have become a principal vehicle for evaluating both practical security skill and automated system performance. Their design, execution, and now partial obsolescence by advanced AI agents provide a comprehensive case paper in competitive security benchmarking.

1. Structural Principles and Benchmark Taxonomy

Jeopardy-style CTFs present participants with a menu of standalone tasks, organized by technical domain and assigned point values scaled to estimated difficulty. Core categories include cryptography, binary exploitation ("pwn"), reverse engineering, web exploitation, forensics, and miscellaneous logic or steganography puzzles. Each challenge's solution is a flag—typically matching the pattern FLAG{.*}—which, when submitted to a central scoring server, credits the team with points. Competition proceeds asynchronously: teams select challenges in any order from the static board, enabling parallelized problem-solving and specialized strategy (Shao et al., 5 Aug 2025).

The most prominent recent benchmarks and frameworks include:

  • CTF-Dojo: 658 reproducible, Dockerized Jeopardy challenges spanning six broad categories with automated validation and metadata encapsulation (Zhuo et al., 25 Aug 2025).
  • CTFTiny: A distilled set of 50 CTF challenges classified empirically for difficulty balance; underpins systematic LLM agent evaluation (Shao et al., 5 Aug 2025).
  • CAIBench: A meta-benchmark integrating 117 Jeopardy CTFs from multiple subdomains (Base, Cybench, Robotics, AutoPenBench) and employing a five-star system to grade difficulty (Sanz-Gómez et al., 28 Oct 2025).

Category composition for representative systems:

Platform Crypto Binary Exploit Rev Eng Forensics Web Misc Robotics
CTF-Dojo 228 163 123 38 21 85
CTFTiny
CAIBench ✓ (RCTF2)

This taxonomy enables structured benchmarking, targeted agent training, and fair scoring protocols.

2. Scoring Models and Evaluation Metrics

Scoring in Jeopardy CTFs historically employs fixed point values per challenge—sometimes augmented with time-decay or hint-based penalties. For example, the CTFd-based university course deployments assigned each challenge a static value; use of hints at zero or nonzero cost adjusted the final tally (Vykopal et al., 2020). More advanced models incentivize complex objectives, such as stealth in IDS-specific Jeopardy events, by applying a convex logarithmic decay function: f(x)=max(b,asln(x)(ab))f(x) = \max\left(b,\, a - s \ln(x)\,(a-b)\right) where xx is cumulative alert severity, aa and bb are bounds, and ss tunes decay (Kern et al., 20 Jan 2025).

AI agent competitions frequently adopt the Pass@k metric: Pass@k=1(Nck)(Nk)\text{Pass@k} = 1 - \frac{\binom{N - c}{k}}{\binom{N}{k}} where NN is sample count and cc is correct flags; Pass@1 (c/Nc/N) is standard for head-to-head performance (Zhuo et al., 25 Aug 2025).

Multi-dimensional scoring is advocated for richer evaluation, capturing procedural correctness across dimensions like reconnaissance, methodology, accuracy, and adaptability—exemplified by the CTF Competency Index (CCI), where

CCI(T,G)=iwiFi(T,G),iwi=1\text{CCI}(T,G) = \sum_i w_i F_i(T,G),\quad \sum_i w_i=1

with partial credit reflecting granular alignment to expert write-ups (Shao et al., 5 Aug 2025).

Empirically, state-of-the-art AI agents exhibit saturation on sub-benchmarks of easy/medium difficulty (67–75% solves), but performance on harder/novel tasks remains variable (10–46%) and domain-dependent (Sanz-Gómez et al., 28 Oct 2025).

3. Infrastructure and Automation: From Manual Setup to Execution-Grounded Pipelines

Challenge environment reproducibility and automation are now standard. CTF-Dojo's architecture employs isolated Docker containers per challenge, with standardized run scripts, port exposure, and flag permissions (mode 444 in /flag). Automated build, run, and validation processes attain >98% robustness over hundreds of instances (Zhuo et al., 25 Aug 2025).

Automated environment pipelines, such as CTF-Forge, leverage LLMs (e.g., DeepSeek-V3-0324) to ingest challenge artifacts and construct Dockerfiles, docker-compose configurations, and metadata JSONs, enforcing environment uniformity while eliminating weeks of manual engineering. This enables scalable challenge ingestion, supports reproducibility, and guarantees rapid rollback or augmentation as new challenge sets emerge (Zhuo et al., 25 Aug 2025).

For live Jeopardy-IDS fusion events, per-team Kubernetes namespaces or containers ensure precise logging and state isolation, supporting high-confidence metric attribution for both benchmarking and research (Kern et al., 20 Jan 2025).

4. AI Agent Methodologies and Capabilities

AI-driven Jeopardy CTF agents now routinely surpass human teams on conventional challenge sets. Prominent systems include:

  • CAI alias1: Proprietary LLM architecture with dynamic cost-saving model switching via entropy-combined uncertainty (Ecombined\mathcal{E}_{\text{combined}}), achieving 98% inference cost reduction for large-scale participation (Mayoral-Vilches et al., 2 Dec 2025).
  • CTF-Dojo-trained Qwen3 variants: Demonstrated that fine-tuning on only 486 execution-verified trajectories yielded a 31.9% Pass@1 success, rivalling closed-source systems—average scores: 83.5% (InterCode), 10.4% (NYU CTF), 17.5% (Cybench) (Zhuo et al., 25 Aug 2025).
  • GPT-5: Through integrated tool use, chain-of-thought reasoning, and code-debug feedback, achieved median solve times (20–40 min per senior-level challenge) and outperforming 90%+ of human teams in prominent CTFs—though requires human-in-the-loop execution and prompt engineering (Reworr et al., 6 Nov 2025).

AI agent workflows generalize as: ingest challenge → plan attack (prompting, analysis) → iterative command execution (e.g., binary analysis, cryptanalytic reduction, HTTP probing) → interpret outputs → flag extraction. Domain-specific routines (Python crypto, OpenSSL, radare2, PCAP forensic analysis) are invoked dynamically. This modular orchestration allows robust solution pipelines that scale across categories, with limited hand-written specialization (Mayoral-Vilches et al., 2 Dec 2025, Reworr et al., 6 Nov 2025).

CTFJudge extends this paradigm to include LLM-based trajectory evaluation, scoring solution processes along six critical factors, enabling nuanced feedback and diagnostic analytics (Shao et al., 5 Aug 2025).

5. Design Methodologies: Challenge Development, Scaffolding, and Analytics

Effective Jeopardy CTF design considers not only technical scaffolding but also behavioral and learning outcomes. Key design practices include:

  • Difficulty Calibration: Empirical solve counts inform point allocation—a quarter each “very easy,” “easy,” “moderate,” and “hard” in CTFTiny—ensuring balanced progression (Shao et al., 5 Aug 2025).
  • Progressive Challenge Chains: Locking later challenges until prerequisites are solved directs learners and inhibits flag sharing (Vykopal et al., 2020).
  • Hint Engineering: Time-release hints with carefully set penalties balance accessibility and challenge, with monitoring to inject backup hints as needed (Vykopal et al., 2020).
  • Plagiarism Detection: Analytics dashboards flag near-simultaneous submissions, wrong-flag anomalies, and attachment download events, supporting enforcement of collaboration policies and individualized pacing (Vykopal et al., 2020).
  • Scoring Function Selection: For specialized domains (e.g., IDS evasion), custom decay mechanisms are introduced to incentivize skillful stealth beyond rote exploitation (Kern et al., 20 Jan 2025).

Platforms supporting fine-grained analytics (CTFd, custom log pipelines) facilitate tracing performance, learning correlations (e.g., ρ = 0.50 between course CTF and exam grade), and in-depth post-hoc evaluation (Vykopal et al., 2020).

6. Impact and Current Limitations in the Era of AI

The ascendancy of high-performance AI has rendered conventional Jeopardy-style CTFs a "solved game" for top models. For example, CAI’s alias1 agent achieved a 91% flag solve rate at Neurogrid CTF, beat human teams’ speed by up to 37% in mixed events, and maintained top-quartile rankings across all major circuits—mean percentile ~99% (Mayoral-Vilches et al., 2 Dec 2025). Similarly, GPT-5’s performance in elite CTFs consistently outranked 90–93% of human participants (Reworr et al., 6 Nov 2025).

Despite this, agent solutions on high-complexity and robotic tasks (e.g., CAIBench RCTF2 sub-benchmark) remain modest (22% success), reflecting persistent capability gaps in multi-step, novel, or cyber-physical security domains (Sanz-Gómez et al., 28 Oct 2025).

A pivotal observation across frameworks is the performance gap between knowledge and application: LLMs saturate static security knowledge benchmarks (70–89% success) but underperform in dynamic, multi-stage Jeopardy CTFs, especially on advanced challenges (Sanz-Gómez et al., 28 Oct 2025). This suggests fundamental limitations in current agent adaptability, contextual reasoning, and real-world exploit orchestration.

The obsolescence of the Jeopardy format as a discriminative test is increasingly recognized. When multiple AI agents approach 85–95% challenge completion, competition outcome reduces to compute allocation and bandwidth rather than conceptual security mastery. Authors advocate migration to Attack & Defense CTFs, which require live service defense, adversary adaptation, and continuous triage—domains where human resilience and adaptive reasoning remain preeminent (Mayoral-Vilches et al., 2 Dec 2025).

7. Applications and Evolving Directions

Jeopardy-style CTFs remain integral for:

  • Benchmarking LLM and AI agents on executable security tasks (CTF-Dojo, CAIBench) (Zhuo et al., 25 Aug 2025, Sanz-Gómez et al., 28 Oct 2025).
  • Offensive security agent research, enabling rapid prototyping, hyperparameter optimization, and trajectory-based analytics (CTFTiny, CTFJudge) (Shao et al., 5 Aug 2025).
  • Education and skills assessment, with demonstrated correlation to learning outcomes in undergraduate courses (Vykopal et al., 2020).
  • Security system (e.g., IDS) red-teaming and rule refinement, where the human and agent-driven discovery of evasive payloads provides actionable insights for defense engineering (Kern et al., 20 Jan 2025).

Future research aims to:

In sum, Jeopardy-style CTFs, while facing diminishing discriminative power in the face of advanced automation, continue to serve as a central testbed for technical assessment, agent training, and security evaluation—pending ongoing evolution toward formats that capture the adaptive, resilient, and creative aspects of real-world adversarial cybersecurity.

Whiteboard

Follow Topic

Get notified by email when new papers are published related to Jeopardy-style CTFs.