Dragos OT CTF 2025 Insights
- Dragos OT CTF 2025 is a large-scale cybersecurity event focused on industrial control systems, featuring 34 practical challenges that simulate real-world OT incidents.
- The competition employed diverse methodologies including binary analysis, PCAP forensics, and hardware dissection to benchmark performance between advanced CAI systems and elite human teams.
- The event demonstrated that while CAI systems excel in early-phase response, sustained strategic adaptation and endurance remain the domain of human expertise.
The Dragos OT CTF 2025 was a large-scale, 48-hour operational technology (OT) cybersecurity competition focused on industrial control systems (ICS), with over 1,000 participating teams. The competition provided a rigorous, multi-domain benchmark for incident response, malware analysis, network forensics, and reverse engineering in the OT/ICS context. It featured a carefully curated set of 34 challenges spanning forensics, PCAP analysis, ICS hardware analysis, binary and reverse engineering, Windows event log analysis, and additional focus areas such as phishing and initial access vectors. The 2025 event served as a crucible for the empirical evaluation of state-of-the-art Cybersecurity AI (CAI) systems and their comparative performance against elite human SOC teams (Mayoral-Vilches et al., 7 Nov 2025).
1. Competition Architecture and Challenge Landscape
The Dragos OT CTF 2025 was structured as a two-day continuous-play event, in which teams competed to solve a diverse set of practical ICS/OT security challenges for points. Each challenge had a fixed point value (ranging from 100 to 1,000) and aligned with real-world scenarios—malware triage, packet capture (PCAP) forensics, hardware protocol dissection, binary exploit reverse engineering, and log analysis.
Tasks typically required:
- Binary analysis using tools such as radare2, strings, and Ghidra for extracting ICS flags.
- Deep packet inspection of ICS protocol traffic (e.g., Modbus, DNP3) using tshark and custom scripts.
- Analysis of Windows event logs for lateral movement and privilege escalation artifacts within simulated OT environments.
- Reverse engineering of PLC firmware images or ICS device logs.
- Multi-stage evidence correlation where network traffic, binaries, and log files must be synthesized for complete incident reconstruction.
The top three human teams each completed 33 of the 34 challenges, with only the capstone binary (“Moot Force”, 1,000 pts) and one advanced category (“Kiddy Tags – 1”, 600 pts) left unsolved by all.
2. Cybersecurity AI (CAI) System Deployment
In the 2025 event, a specialized implementation of the CAI framework ran "alias1," a security-tuned LLM deployed as a multi-agent system. The key architectural properties were:
- Agents: “red_teamer” for offensive binary/forensic/analysis, “blue_teamer” for defensive event correlation and triage, executing in parallel.
- Tool orchestration: Autonomous invocation of standard ICS forensic and analysis tools inside Docker-isolated Kali Linux containers (no access to local LLM weights; queries made via remote API).
- Data handling: All tool output, prompt interactions, and agent actions were logged with timestamped JSONL telemetry, providing high-fidelity traceability of action pipelines.
- HITL Minimization: Human operators intervened only for strategic challenge retrieval, flag submission, or hint redemption (with point penalties), not for core analytical reasoning or tool selection.
All code execution was strictly sandboxed; prompt filtering was implemented to defend against prompt-injection attacks commonly seen in LLM-agent architectures (Mayoral-Vilches et al., 29 Aug 2025).
3. Quantitative Performance Analysis
CAI’s performance trajectory was quantified along several axes: points per hour, velocity to specific thresholds, categorical solve rates, and solve cadence.
Table 1: Milestone Velocity and Points per Hour
| Milestone | CAI Time (h) | CAI Pts/h | Top-3 Average Time (h) | Top-3 Pts/h |
|---|---|---|---|---|
| 1,000 pts | 0.24 | – | 0.30 | – |
| 5,000 pts | 3.45 | – | 2.42 | – |
| 10,000 pts | 5.42 | 1,846 | 6.43 | 1,347 |
CAI reached 10,000 points in 5.42 hours, a 37% higher velocity relative to the top-5 human average (1,846 pts/h vs. 1,347 pts/h).
Table 2: Category Breakdown
| Category | CAI Solves | CAI Points | Top-3 Avg Solves | Top-3 Avg Points |
|---|---|---|---|---|
| Forensics Analysis | 8 | 5,800 | 8 | 5,800 |
| ICS PCAP Analysis | 5 | 3,200 | 5 | 3,200 |
| RE (PCAP/Binary) | 4+3 | 4,800 | 5+4 | 4,800 |
| ICS Hardware Analysis | 3 | 2,600 | 5 | 2,600 |
| Windows Event Log | 2 | 1,200 | 2 | 1,200 |
| Trivia/Phishing/Intro | 7 | 1,300 | 7 | 1,300 |
| Total | 32 | 18,900 | 33 | 19,900 |
CAI captured 92.2% of the total available points. The only unsolved challenges were at the uppermost difficulty tier (Moot Force 1,000 pts; Kiddy Tags 600 pts), which also evaded resolution by the majority of competing teams.
4. Temporal Cadence and Endurance Limits
Time-resolved scoring and solve cadence revealed a burst-skewed performance profile:
- Hours 1–7: CAI sustained an average of 1,671 pts/h (11,700 pts achieved), outpacing all human competitors in this interval.
- Hours 7–24: CAI plateaued, finishing its last task at hour 24 (18,900 pts, Rank 6), with no new solves in the remaining 24 hours (strategic pause for research monitoring).
- Human teams: While slower initially, top-3 human teams maintained consistent late-game progress, completing additional high-value challenges and overtaking CAI by hour 48 due to their endurance and adaptive task prioritization.
This pattern illustrates superhuman initial response and triage efficiency by CAI, but limited sustained operational adaptability without continuous retargeting or explicit HITL scheduling.
5. Comparative Analysis: AI Advantage and Human Expertise
Key empirically validated findings from this cohort:
- Early-phase AI advantage: Autonomous, LLM-powered CAI systems achieve higher initial flag acquisition rates and broader category coverage in the first 7 hours of the competition.
- Sustained operation limitations: Without dynamic strategy adaptation or fatigue management (currently human-expert prerogatives), even advanced CAI agents experience solve-rate attrition after burst completion.
- Category consistency: CAI and top human teams both achieved uniformly high solves in forensics, protocol analysis, reverse engineering, and hardware tasks, indicating that agentic approaches can generalize across industrial ICS problem classes when properly mission-configured.
- HITL/Human–AI orchestration: Direct manual reasoning, creative pivoting, and strategic task resampling allowed top human teams to outperform CAI in the endurance phase, especially for challenges requiring meta-reasoning (e.g., “knowing when to stop” or chaining partial clues across unstructured datasets).
6. Architectural Controls and Security Considerations
CAI’s deployment strictly enforced:
- Ephemeral containerization and network segmentation for all tool invocations, mitigating host risks from tool misuse or accidental self-exploitation.
- Prompt filtering prior to LLM context assembly, reducing risk of prompt-injection-based system compromise (Mayoral-Vilches et al., 29 Aug 2025).
- Minimal human oversight enforcement consistent with a Level 4 “Cybersecurity AI” autonomy profile, distinguishing the exercise from both fully manual triage and HITL-embedded workflows.
7. Implications for OT Cybersecurity and Future Directions
The Dragos OT CTF 2025 results demonstrate that:
- State-of-the-art CAI systems, when properly instrumented, can match or exceed human crews in early-phase OT incident response and artifact triage.
- AI systems exhibit greater temporal volatility—frontloading solves but lacking adaptive stamina, which remains a human strength.
- Integrated hybrid security operation centers—leveraging AI agents for first-tier response, with human supervisors for strategic planning, anomaly detection, and endurance tasks—constitute an empirically supported architecture for industrial-scale OT defense moving forward.
Further research is required to:
- Enhance strategic resource allocation in CAI frameworks for sustained, multi-day SOC operations.
- Extend adversarial robustness and prompt-injection defense mechanisms specifically for high-assurance industrial deployments.
- Systematize explainability, auditability, and adaptive scheduling for prolonged OT incident response incorporating both human and AI agents.
The Dragos OT CTF 2025 provides a reproducible, labor-relevant benchmark for measuring and validating Cybersecurity AI advances in industrial environments, underscoring the dynamic symbiosis between automation and expert supervision in modern cyber defense (Mayoral-Vilches et al., 7 Nov 2025).
Sponsored by Paperpile, the PDF & BibTeX manager trusted by top AI labs.
Get 30 days free