OCCULT: Evaluating Large Language Models for Offensive Cyber Operation Capabilities (2502.15797v1)

Published 18 Feb 2025 in cs.CR and cs.AI

Abstract: The prospect of AI competing in the adversarial landscape of cyber security has long been considered one of the most impactful, challenging, and potentially dangerous applications of AI. Here, we demonstrate a new approach to assessing AI's progress towards enabling and scaling real-world offensive cyber operations (OCO) tactics in use by modern threat actors. We detail OCCULT, a lightweight operational evaluation framework that allows cyber security experts to contribute to rigorous and repeatable measurement of the plausible cyber security risks associated with any given LLM or AI employed for OCO. We also prototype and evaluate three very different OCO benchmarks for LLMs that demonstrate our approach and serve as examples for building benchmarks under the OCCULT framework. Finally, we provide preliminary evaluation results to demonstrate how this framework allows us to move beyond traditional all-or-nothing tests, such as those crafted from educational exercises like capture-the-flag environments, to contextualize our indicators and warnings in true cyber threat scenarios that present risks to modern infrastructure. We find that there has been significant recent advancement in the risks of AI being used to scale realistic cyber threats. For the first time, we find a model (DeepSeek-R1) is capable of correctly answering over 90% of challenging offensive cyber knowledge tests in our Threat Actor Competency Test for LLMs (TACTL) multiple-choice benchmarks. We also show how Meta's Llama and Mistral's Mixtral model families show marked performance improvements over earlier models against our benchmarks where LLMs act as offensive agents in MITRE's high-fidelity offensive and defensive cyber operations simulation environment, CyberLayer.

Summary

The paper introduces OCCULT, a structured evaluation framework designed to quantify large language models' capabilities in offensive cyber operations using realistic tasks and assessments.
Benchmarks like TACTL, BloodHound, and CyberLayer quantify capabilities, showing models like DeepSeek-R1 achieve over 90% accuracy in specific offensive cyber operation tasks.
Evaluations reveal a correlation between model size and performance and highlight trade-offs between accuracy, inference time, and operational footprint in simulated environments.

Overview

The paper "OCCULT: Evaluating LLMs for Offensive Cyber Operation Capabilities" (2502.15797) introduces a structured evaluation framework designed to rigorously quantify the potential of LLMs in facilitating offensive cyber operations (OCO). The framework, termed OCCULT, integrates operationally realistic tasks with structured assessments of a model’s OCO capabilities. The paper systematically categorizes OCO tasks, defines explicit test cases, and reports numerical benchmarks that underscore the evolving threat landscape of AI in cyber security operations.

OCCULT Framework and Methodology

The OCCULT framework is divided into several well-defined dimensions:

OCO Capability Areas: These include domains directly mapped to adversarial cyber operations, explicitly linking each evaluation with cybersecurity constructs such as the MITRE ATT&CK matrix.
LLM Use Cases: The framework stratifies LLM applications into three distinct use cases: Knowledge Assistant, Co-Orchestration with external tools (e.g., MITRE Caldera), and Autonomous agent functionalities. Each use case encapsulates different operational expectations and degrees of autonomy.
Reasoning Power Assessment: The framework evaluates the reasoning capabilities of LLMs based on four key dimensions: planning sequence, environment perception, action creation/modification, and task/solution generalization. These dimensions allow for both quantitative and qualitative assessment, thereby enabling a more granular analysis of model performance.

The methodology deploys dynamic question generation (notably in the Threat Actor Competency Test for LLMs, TACTL) to circumvent memorization. This adds robustness to the evaluation metrics and ensures that assessment outcomes reflect genuine capabilities rather than rote knowledge. Synthetic data generation (as seen in the BloodHound Equivalency Test) is employed to mimic enterprise Active Directory environments, thereby grounding the evaluations in operationally relevant scenarios. Additionally, high-fidelity simulation environments (CyberLayer) are leveraged to stress-test models in scenarios with complex network topologies and adversarial conditions.

Evaluation Benchmarks

The paper details three principal benchmark scenarios, each designed to expose different facets of an LLM’s OCO capabilities:

Threat Actor Competency Test for LLMs (TACTL):
- Comprising multiple-choice assessments based on real-world OCO scenarios, TACTL evaluates models on a spectrum of 44 ATT&CK techniques.
- Numerical results are notable: for example, the DeepSeek-R1 model reliably scored above 90% accuracy across challenging questions that extend beyond conventional educational exercises.
BloodHound Equivalency Test:
- This test assesses an LLM’s proficiency in synthesizing and analyzing Active Directory structures, effectively gauging its capacity to identify lateral movement and potential attack paths. The numerical performance in this benchmark was robust on attribute identification tasks, although more nuanced queries presented higher difficulty, reflecting a differential in model interpretability.
CyberLayer Simulation Environment:
- In this simulation, the measured performance encompassed goal completion efficiency, stealth (artifact minimization), and adaptability to varying network topologies.
- The test demonstrates that some LLMs, exemplified by Meta Llama 3.1 70B, exhibit consistent and efficient performance when navigating high-fidelity attack scenarios, with a clear trade-off observed between accuracy and inference latency.

Key Findings and Numerical Results

The experimental analysis demonstrates several critical outcomes:

High-Performance Outlier: The DeepSeek-R1 model attained over 90% correctness on TACTL benchmarks, marking a significant numerical milestone in offensive cyber operation tasks. This represents the first instance of such high performance under realistic and dynamically generated test conditions.
Model Comparisons Across Use Cases:
- The paper shows a correlation between model size (i.e., parameter count) and performance in OCO tasks. Meta’s Llama and Mistral’s Mixtral families exhibit marked improvements when benchmarked against earlier generation models.
- Models deployed in the CyberLayer simulation display variability in efficiency: while some models achieve rapid goal completion with reduced detectable artifacts, others exhibit a trade-off where a marginal increase in accuracy results in increased operational footprint.
Trade-Off Analysis: The evaluations signal an inherent trade-off between inferential time and accuracy, especially in the high-fidelity cyber simulation environment. This finding is essential for real-world implementation where latency is a critical factor in operational settings.
Qualitative Reasoning Assessment: Beyond numerical performance, the framework’s approach to assessing reasoning power (through planning sequences and environmental perception) provides a richer characterization of model capability. This is particularly relevant when considering LLM integration into systems that require autonomous operational decision-making.

Practical and Operational Implications

From an applied perspective, the OCCULT framework provides a rigorous and repeatable methodology for both offensive and defensive cybersecurity evaluations:

Operational Deployment Considerations:
- Real-world adversarial scenarios necessitate not only high accuracy but also low-latency responses. The paper’s insights into the trade-offs between performance and inference time are critical for developing systems where speed and subtlety in operation are paramount.
- The dynamic test case generation and synthetic data techniques could be adapted to continuously benchmark deployed systems and simulate emerging threat vectors.
Benchmarking in an Evolving Threat Landscape:
- Given the numerical results and structured evaluation, practitioners can derive risk assessments concerning the scalability of OCO tasks performed by LLMs.
- The framework’s adaptability to diverse OCO scenarios (from passive knowledge recall to active cyber operations) makes it both a diagnostic tool and a predictive measure for emerging AI-enabled threats.
Scaling and Integration:
- In large-scale cybersecurity operations, integrating high-performing LLMs like DeepSeek-R1 (or those with similar operational profiles) into defensive mechanisms will require balancing computational overhead with rapid decision-making capabilities.
- Deployment strategies might involve a hybrid approach where LLMs are used primarily as advisory systems or co-orchestrated agents, supplemented by deterministic algorithms for faster, high-stakes decision loops.

Conclusion

The comprehensive evaluation presented in "OCCULT: Evaluating LLMs for Offensive Cyber Operation Capabilities" represents a methodologically robust and operationally nuanced approach to assessing AI-enabled cyber threat potentials. The framework not only benchmarks current capabilities with strong numerical results (e.g., 90%+ performance on TACTL) but also introduces a scalable evaluation paradigm that accounts for dynamic test generation, realistic simulation environments, and intrinsic reasoning capabilities. These contributions are instrumental for practitioners seeking to understand the dual-edged nature of LLM advancements in the context of offensive cyber operations, providing both quantitative and qualitative measures that inform risk, deployment, and defensive strategies.