TACTL: Threat Actor Competency Test for LLMs
- TACTL is an integrated benchmarking methodology that evaluates LLMs' offensive cybersecurity capabilities across realistic, emulated enterprise environments.
- It employs detailed metrics such as success rate, mean time to compromise, and vulnerability coverage to provide statistically robust performance assessments.
- TACTL integrates agent-based penetration testing, cognitive inference, and threat actor attribution to mirror real-world cybersecurity challenges.
The Threat Actor Competency Test for LLMs (TACTL) is an integrated benchmarking methodology designed to rigorously evaluate the offensive cybersecurity and attribution capabilities of LLMs across a spectrum of threat scenarios. Drawing on the recommendations from contemporary research and empirical protocols, TACTL provides a reproducible, statistically robust, and practitioner-aligned assessment spanning agent-based penetration testing, cyber threat intelligence (CTI) analysis, real-world exploit reasoning, and threat-actor attribution (Happe et al., 14 Apr 2025, Guru et al., 15 May 2025, Moskal et al., 2023, Kouremetis et al., 18 Feb 2025, Hans et al., 23 Oct 2025, Munshi et al., 16 May 2025, Alam et al., 3 Nov 2025).
1. Testbed Design and Scenario Engineering
TACTL prescribes a multi-layered testbed that emulates authentic enterprise environments, ensuring both realism and repeatability. The canonical testbed comprises:
- Network Topology: Segmented architecture featuring a DMZ with public-facing services (e.g., HTTP, SSH), an internal LAN with a Windows Active Directory forest (domain controller, file server, heterogeneous workstations), an isolated development VLAN hosting Linux databases and CI/CD hosts, and a jump host enforcing trust boundaries.
- Configuration Hygiene: Parameterized host naming and credential schemes (e.g., host-Aₖ, passₖ) per run to preclude information leakage from LLM training. Background traffic and ephemeral user accounts inject nondeterminism.
- Vulnerability Profile: A curated set of ~30–50 exploits tasks, balancing CVE-based vulnerabilities (web, kernel, protocol misconfigurations), misconfigurations (weak keys, default credentials), and contextual CTF-style puzzles (reverse engineering only as practitioner-justified). All cases are fully documented with CVE ID, software/version, expected kill-chain, and pre-requisite recon (Happe et al., 14 Apr 2025).
Scenario complexity spans flat to pivoted topologies, enforces defensive simulation mechanisms (e.g., IDS alerting, honeypots), and supports dynamic variable substitution in benchmarks to prevent answer memorization (Kouremetis et al., 18 Feb 2025).
2. Metrics and Quantitative Evaluation
TACTL quantifies model performance via orthogonal, mathematically defined metrics:
- Success Rate (SR):
where for the outcome of test case in run .
- Mean Time to Compromise (MTC):
where is the elapsed time for successful exploits.
- Vulnerability Coverage (COV):
Each denotes exploitation of vulnerability .
- Subtask Progression Rate for fine-grained workflows,
with subtasks per case, completed per run.
- Statistical Validity: Task grouping by attack type (Web/Windows/Linux/Network) and difficulty (easy–hard); independent runs per LLM; controls for max steps and execution time; significance assessed via paired t-test/Wilcoxon for SR, Mann–Whitney U for MTC, reporting effect sizes (Happe et al., 14 Apr 2025).
- Baseline Comparisons:
- Human expert transcripts (SR, MTC),
- Automated tools (e.g., Metasploit, ZAP),
- Standard LLM baselines with recorded prompts and versioning—all executed under matched task constraints.
- Cost-Efficiency: Token usage tracked per run to correlate computational expenditure with SR/MTC (Happe et al., 14 Apr 2025).
3. Task Modalities and Agent Architectures
TACTL encompasses multiple task modalities, reflecting the breadth of real-world cyber operations:
Agent-based Penetration Testing: LLMs function as autonomous agents, orchestrated via a prompt-chaining architecture—plan-act-report. The loop involves: - Plan/tactic selection: LLM determines next stage (RECON, EXPLOIT, EXFILTRATE). - Act/execution: LLM outputs executable commands, leveraging tools like Kali Linux, Nmap, Metasploit. - Report/evaluation: LLM summarizes command outcomes (SUCCESS/FAIL), feeding back into the planning stage (Moskal et al., 2023).
Stage-specific metrics (e.g., ) and decision accuracy are aggregated:
Competency is defined by meeting threshold rates, e.g., (Moskal et al., 2023).
Offensive Cyber Operations (OCO) Reasoning: Multiple-choice benchmarks derived from real-world scenarios and tagged to MITRE ATT&CK tactics/techniques rigorously probe the LLM’s environment perception and action selection capabilities. Dynamic variable instantiation and scenario-driven options preclude answer memorization (Kouremetis et al., 18 Feb 2025).
Log-to-ATT&CK Mapping and Cognitive Inference: TACTL integrates modules for parsing IDS logs, segmenting into behavioral phases, mapping to ATT&CK tactics/techniques (via RAG-LLM), and inferring cognitive traits (e.g., loss aversion) from phase transitions (Hans et al., 23 Oct 2025). Outputs are formally defined (LaTeX-specified schemas) and scored against ground-truth with 70% weight on ATT&CK mapping, 30% on cognitive inference.
Threat Actor Attribution and CTI: Workflow includes structured TTP extraction from unstructured reports (LLM- or embedding-based), probabilistic actor-ranking using normalized TTP frequency matrices, and benchmarking attribution accuracy against random and expert baselines (average rank, Jaccard similarity, precision/recall) (Guru et al., 15 May 2025, Alam et al., 3 Nov 2025).
Cloud Threat Modeling: Modular tasks—asset identification, threat enumeration (mapped to STRIDE/ATT&CK/OWASP), attack-path analysis, and mitigation recommendation—are benchmarked on cloud-native infrastructure, leveraging production-scale datasets (e.g., ACSE-Eval’s 100 AWS scenarios) (Munshi et al., 16 May 2025).
4. Qualitative and Error Analysis
TACTL protocol mandates in-depth qualitative assessment to contextualize quantitative scores:
- Failure Taxonomy: Manual review of 10–15% of agent runs, with failures categorized as reconnaissance omission, exploit misuse, logical dead-ends, or environment misunderstanding.
- Attack Graph Tracing: Command sequences are mapped against predefined kill-chains to discover bottlenecks or recurrent failures.
- Narrative Walk-throughs: Annotated transcripts from representative and edge-case runs expose reasoning errors, prompt/response loop pathologies, and recovery behaviors.
- Error Modes in Attribution: Models exhibit superficial keyword matching, overgeneralization across TTPs, insufficient negative reasoning, and limited temporal causal reasoning (Alam et al., 3 Nov 2025, Guru et al., 15 May 2025).
- Post-processing: Deprecated MITRE IDs and hallucinated codes are filtered; prompts are stringently engineered to constrain LLM outputs to valid taxonomies (Guru et al., 15 May 2025, Happe et al., 14 Apr 2025).
5. Dataset Construction and Task Pipelines
TACTL benchmarks draw from dynamically assembled and deduplicated CTI corpora:
- Data Sources: MITRE ATT&CK STIX bundles, NVD (CVE/CWE/CVSS), public APT reports (Malpedia, vendor advisories), operational logs (DARPA TC, synthetic scenarios), production-grade IaC/architecture stacks (Cloud Security Engineering Eval) (Alam et al., 3 Nov 2025, Munshi et al., 16 May 2025).
- Anonymization: Threat-actor attribution data is anonymized (“they”) and strips explicit actor/entity mentions, ensuring no label leakage for abductive inference.
- Task Decomposition: Scenarios are decomposed into granular tasks: MCQs, root-cause mapping, attack technique extraction, risk mitigation recommendation, subtask progression, and scenario-specific attribution (Alam et al., 3 Nov 2025).
6. Evaluation Methodology and Scoring
TACTL employs standardized, task-appropriate metrics:
- Classification:
- Accuracy for single-label tasks.
- Precision, recall, F1 for multi-label (e.g., mitigations) and open-ended mappings.
- Cosine similarity for embedding alignment.
- Jaccard similarity for TTP set overlap.
- Ranking and Explanation:
- Average rank of true threat actors in attribution ( denotes above-random).
- “Explanation fidelity” via token-level precision-recall on rationale segments (Alam et al., 3 Nov 2025).
- Composite Indices:
- Multi-metric scores (accuracy, F1, explanation bonus) are linearly combined.
- Bias weighting for cognitive tests: ATT&CK mapping, bias F1 (Hans et al., 23 Oct 2025).
Cross-validation, random seeds for repeatability, and expert review ensure methodological rigor.
7. Best Practices and Future Extensions
TACTL is designed for extensibility and adaptation:
- Prompt engineering: Explicit response schemas, constraint-based instructions, and line-based chunking for extraction reduce noise and hallucination (Guru et al., 15 May 2025).
- Retrieval-Augmented Evaluation: HyDE-style technique definition augmentations, indexable prior histories, and live external knowledge retrieval improve recall and reduce omissions.
- Dynamic/Real-time Scenarios: Continuous ingestion of new APT/CTI reports, “real-time” IaC drift, adaptive threat-model updating, and agent-in-the-loop simulation (e.g., CyberLayer).
- Attack Path Coverage & Remediation: Advanced metrics such as Attack Path Coverage (fraction of adversary steps detected) and Remediation Feasibility (implementability of defensive recommendations) (Munshi et al., 16 May 2025).
- Community and Toolkit Release: Open-sourced question/scenario databases, Python scoring libraries, dashboard visualizations, and cross-benchmark validation suites are recommended for reproducibility (Kouremetis et al., 18 Feb 2025, Munshi et al., 16 May 2025, Hans et al., 23 Oct 2025).
TACTL thereby systematizes the measurement of LLM emergent cyber-offensive capabilities, attribution skill, and high-level behavioral reasoning, aligning benchmark design with both research and practical security assessment imperatives.