Execution-Grounded Security Benchmark
- Execution-grounded Security Benchmarks are dynamic evaluation frameworks that assess system security by executing code in real, instrumented environments.
- They integrate diverse test suites, adversarial inputs, and detailed performance metrics to measure vulnerabilities and functional resilience.
- These benchmarks enable comparative analysis and rapid identification of security gaps in modern systems, AI agents, and protocol-driven platforms.
An execution-grounded security benchmark is a systematic evaluation framework in which candidate systems, code, or agents are run within real, instrumented environments to capture their security-relevant behavior under live operational conditions. Unlike purely static, logic-based, or simulation-only benchmarks, execution-grounded methodologies measure vulnerabilities, resilience, and correctness by directly observing outcomes during actual execution—ranging from operating system calls in TEEs, code agent actions in Docker sandboxes, to full agent-environment trajectories in financial sandboxes, LLM tool-use stacks, or interactive web GUIs. Execution-grounded security benchmarks are now central to comparative evaluation and the analysis of stateful security properties in modern code, AI agents, and protocol-aware platform deployments.
1. Core Principles of Execution-Grounded Security Benchmarks
The defining property of an execution-grounded benchmark is the use of concrete, automatable test suites whose security metrics derive from direct execution traces within operationally realistic environments. Security outcomes are measured as the observable results of actual system instantiation, not from static analysis or logic modeling.
A paradigmatic example is SGXGauge, which measures Trusted Execution Environment (TEE) security by running real-world workloads inside Intel SGX enclaves and collecting low-level Linux/perf/driver counters (e.g., dTLB misses, EPC evictions) (Kumar et al., 2022). Similarly, network security benchmarks such as NetBASILISK assess transfer integrity and overheads in live multi-hop WAN scenarios using authentic data flows and monitoring frameworks (Guhit et al., 2021). For code security, AutoBaxBuilder and DUALGAUGE execute generated programs in sandboxed containers, running adversarial and functional test suites to independently verify both correctness and exploitability (Arx et al., 24 Dec 2025, Pathak et al., 24 Nov 2025).
Execution-grounded approaches are essential in domains—such as financial agents (Yang et al., 9 Jan 2026), LLM tool-use and MCP protocols (Zhang et al., 14 Oct 2025), risky code agents (Guo et al., 2024), and LVLM-based web agents (Ying et al., 11 Oct 2025)—where semantic, protocol, privilege, or context-based exploits often bypass defenses that rely solely on static analysis.
2. Benchmark Construction and Methodological Taxonomy
Execution-grounded benchmarks are typically constructed in several steps:
- Task Definition: Select or synthesize a suite of security-relevant tasks (e.g., transaction approval, API tool invocation, binary data-flow).
- Test Suite Generation: Develop functional, adversarial, and edge-case inputs; in advanced setups, this includes diverse input modalities (natural language, code, protocol, GUI) and exhaustive coverage of security-relevant behaviors.
- Instrumentation & Environment Setup: Prepare reproducible, controlled execution grounds—such as Docker containers with resource isolation (Guo et al., 2024, Pathak et al., 24 Nov 2025), snapshot-isolated forked EVM chains, regulated SQL databases with stateful compliance constraints (Yang et al., 9 Jan 2026), or simulated network segments (Guhit et al., 2021).
- Execution and Trace Collection: Run candidate systems or agents on the tasks; collect execution traces, outputs, and observable state transitions via instrumentation (perf, audit logs, code coverage, agent action logs, error counters).
- Metric Computation: Quantify outcomes using well-defined security metrics; examples include Attack Success Rate (ASR), Task Completion Rate (TCR), Net Resilient Performance (NRP), Overhead percent, Vulnerability Detection Rate (VDR), F1-score, and step-efficiency decay.
The following table summarizes selected frameworks and their domains:
| Framework | Environment Instrumentation | Key Security Metrics |
|---|---|---|
| SGXGauge | Linux perf + SGX driver in enclave | EPC miss rate, TLB/page-walk overhead |
| NetBASILISK | XRootD, perfSONAR, Humio pipeline | Transfer throughput, error rate, CPU |
| AutoBaxBuilder | Docker sandbox, exploit scripts | VDR, pass@1, sec_pass@1 |
| DUALGAUGE | Agentic executor + LLM evaluator | pass@k, secure-pass@k, PR, SPR |
| MSB (MCP Sec Bench) | Real MCP tool servers, dynamic harness | ASR, NRP, task completion |
| FinVault | Regulatory sandbox, state database | ASR, TPR, FPR, scenario compromise |
| RedCode | Multiformat tasks in Docker | Rejection rate, ASR, VC |
| SecureWebArena | Browser GUIs, SoM markup | RVR, BCR, PDR (reasoning/behavior/outcome) |
| CIRCLE | Code interpreter sandboxes | Refusal rate, Fulfilled rate, Timeout |
3. Instrumentation and Execution Environments
Instrumentation depth is critical in execution-grounded security evaluation:
- TEE Benchmarks: Use both OS-level and driver-level counters (dTLB, page-walk, EPC events, ECALL/OCALL counts) to analyze the true performance-security trade-off and overhead profile across hardware and library OS settings (Kumar et al., 2022). VM-based TEEs extend this with full memory and CPU-state encryption, measuring I/O and memory overhead at VM boundaries (Coppolino et al., 2024).
- Agentic and Code Generation Benchmarks: Employ Docker images or similar sandboxes, pre-populated with adversarial resources, and agentic program executors capable of automating compilation, dependency resolution, and runtime patching (Pathak et al., 24 Nov 2025). Evaluation harnesses for code interpreters (CIRCLE) enforce resource bounds, probe CPU/memory/disk exhaustion, and apply LLM judges to reason about security (Chua, 25 Jul 2025).
- Protocol/Tool Use Benchmarks: Real-world tool invocations are a necessity for RAS-Eval and MSB; synthetic simulations miss critical errors and vulnerabilities emergent in authentic TCP/API environments (Fu et al., 18 Jun 2025, Zhang et al., 14 Oct 2025).
- Web and GUI Agent Benchmarks: SecureWebArena uses fully instrumented browser environments (HTML/CSS/DOM + SoM markup), capturing chain-of-thought traces, agent actions, and outcome state for every trajectory (Ying et al., 11 Oct 2025).
4. Metric Design and Security Evaluation Criteria
Metrics are tailored to the environment and threat model:
- Attack Success Rate (ASR): Fraction of adversarial trials in which a security policy is violated (e.g., FinVault's formal ) (Yang et al., 9 Jan 2026).
- Task Completion Rate (TCR): Measures functional outcomes under attack (e.g., agent's tool sequence matches human reference) (Fu et al., 18 Jun 2025).
- Net Resilient Performance (NRP): Balances security and functionality: (Zhang et al., 14 Oct 2025).
- VDR, FPR, PR, SPR: Standard detection, false positive, precision and recall rates apply to code and data-flow analysis (Weideman et al., 30 May 2025, Pathak et al., 24 Nov 2025, Arx et al., 24 Dec 2025).
- Reasoning/Behavior/Outcome Rates (RVR, BCR, PDR): SecureWebArena decomposes vulnerabilities into reasoning, behavioral, and outcome-stage failures (Ying et al., 11 Oct 2025).
- Resource Overheads (TEE-specific): Measurement of additional cycles, memory, or latency due to security boundaries (Kumar et al., 2022, Coppolino et al., 2024).
These metrics support fine-grained failure taxonomy, scaling analysis, and facilitate root-cause diagnosis in risk analytics.
5. Empirical Insights and Comparative Analysis
Execution-grounded benchmarks consistently reveal asymmetries in agent/model performance:
- Security–Functionality Trade-off: For code agents and LLMs, joint security–correctness benchmarking (DUALGAUGE, AutoBaxBuilder) demonstrates functional pass rates can collapse to secure pass rates under targeted adversarial scenarios, highlighting the non-linear impact of security constraints (Pathak et al., 24 Nov 2025, Arx et al., 24 Dec 2025).
- Inverse Scaling Effects: Advanced models exhibit higher utility but increased vulnerability rates when tool-use protocols are weaponized, as observed in MCP Security Bench (MSB) (Zhang et al., 14 Oct 2025) and RAS-Eval (Fu et al., 18 Jun 2025).
- Contextual and Semantic Attack Efficacy: In FinVault, semantic, role-based, or multi-turn prompt-based attacks bypass syntactic defenses at much higher rates; protocol and context injection vectors remain potent (Yang et al., 9 Jan 2026).
- Domain-Specific Failure Modes: Insurance and compliance scenarios in FinVault are most vulnerable; code agents operating over file system and OS actions show higher rejection rates, whereas logical bugs are rarely flagged (Guo et al., 2024). LVLM web agents fail systematically on visually deceptive UI attacks despite reasoning safeguards (Ying et al., 11 Oct 2025).
- Mitigation Gaps: Interpreter-level resource exhaustion is insufficiently handled by current API platforms (CIRCLE); even the most robust proactive guard models admit ASR at high operational cost (Chua, 25 Jul 2025, Yang et al., 9 Jan 2026).
6. Design Recommendations and Best Practices
Recurring guidelines for benchmark robustness and extensibility include:
- Multi-modal in-environment safety instrumentation: Integration of protocol-level, semantic, and audit-based detectors in native execution layers (FinVault) (Yang et al., 9 Jan 2026).
- Coverage-driven task and test construction: Benchmarks must enforce risk-based or specification-complete coverage via adversarial augmentation, variant generation, and cross-model validation (Pathak et al., 24 Nov 2025, Arx et al., 24 Dec 2025).
- Automated trace aggregation and labeling: Pipelines must support deterministic collection and script-driven evaluation, minimizing LLM-judge role, except for semantic assessment when inevitable (DUALGAUGE, SecureWebArena) (Ying et al., 11 Oct 2025).
- Extensibility to new threat models: APIs for adding new CWE classes, attack vectors, languages, and domains, as in AutoBaxBuilder's task generation pipeline (Arx et al., 24 Dec 2025).
- Reference implementations and open data: Releasing instrumentation, code, and datasets under reproducible frameworks accelerates community hardening (RAS-Eval, RedCode, FinVault) (Fu et al., 18 Jun 2025, Guo et al., 2024, Yang et al., 9 Jan 2026).
7. Limitations and Future Directions
Execution-grounded security benchmarks, while indispensable, face inherent challenges:
- Resource & Coverage Limitations: Exhaustive dynamic coverage is infeasible for many code and agent environments; benchmarks focus on high-risk patterns and bounded input sets.
- Non-determinism and reproducibility: Program behaviors may vary due to network effects, stochastic agent policies, or environmental side channels.
- LLM and heuristic-based evaluation noise: Automated semantic labeling introduces recall/precision trade-offs, as noted in DualGauge (Pathak et al., 24 Nov 2025).
- Scalability constraints: Exponential trace analysis (e.g., quantum circuit reassembly) and complex multi-agent environments limit benchmarking scope (Bernardi et al., 3 Sep 2025).
- Evolution of attack techniques: Social engineering, multi-turn prompt attacks, and contextually adaptive adversaries demand continuous benchmark renewal and augmentation.
Leading frameworks recommend integrated and native compliance enforcement, persistent identity verification, semantic context monitors, and adaptive, layered reasoning architectures to mitigate advanced vulnerabilities (Yang et al., 9 Jan 2026, Zhang et al., 14 Oct 2025, Fu et al., 18 Jun 2025).
In summary, execution-grounded security benchmarks now underpin the comparative evaluation of code agents, tool-use platforms, financial LLMs, TEE environments, and protocol-driven systems. Through live instrumentation, dynamic adversarial scenarios, and rigorous, environment-sensitive metrics, these benchmarks expose real-world vulnerabilities and guide the next generation of security-hardening research and deployment.