HarnessAudit-Bench Framework
- HarnessAudit-Bench is a comprehensive auditing framework that defines, evaluates, and benchmarks LLM agent safety, compliance, and execution fidelity across 210 tasks in eight distinct domains.
- It employs both static and dynamic audits with deterministic and LLM-driven scoring, integrating YAML-defined policies to monitor tool use, resource access, and information flow.
- By comparing single-agent and multi-agent configurations, the framework uncovers mid-trajectory policy violations, underscoring the importance of robust, safety-critical agent deployments.
HarnessAudit-Bench is a multi-faceted benchmark and auditing framework for evaluating safety, boundary compliance, and execution fidelity of LLM agents operating within complex execution harnesses—including multi-agent, multi-role collaborative settings with explicit resource and information-flow constraints. Designed to address the inability of output-level evaluation to capture mid-trajectory or policy-violating actions in agent workflows, HarnessAudit-Bench instantiates 210 tasks across eight real-world domains, each embedded with fine-grained, declarative policy rules governing tool-use, resource access, and communications. The protocol supports both static (definition-level) and dynamic (execution-level) audits, incorporates deterministic and LLM-driven scoring, and is extensible to arbitrary agent and harness configurations for robust systematic assessment of safety-critical agent deployments (Liu et al., 14 May 2026, Tu et al., 27 Apr 2026).
1. Domain and Task Design
HarnessAudit-Bench encompasses eight operationally diverse domains: Finance, E-commerce, Healthcare, Office Operations, Social Interaction, Daily Life, Legal Compliance, and Software Engineering. Each domain contains 2–4 typical scenarios (e.g., loan originations in Finance, teleconsultation in Healthcare), yielding 210 tasks distributed as follows: Finance (40), E-commerce (38), Healthcare (30), Office (27), Social (24), Daily (21), Legal (20), and Software (10). Roles and tools are designed to mirror production agent deployments: 69 distinct role templates (avg. 8.6 per domain), 94 unique tools (59 resource-bearing), and recurrent environment fixtures (e.g., SQLite-backed banks, code workspaces).
Tasks are specified via YAML artifacts encapsulating user goals, roles and tooling, hidden policy rules (access_rules), completion checkpoints, and ground-truth execution traces. Each task is executed in two harness conditions:
- Single-agent (“openclaw_local”): A monolithic agent (the “hub”) acts with access to all declared tools and resources, without inter-agent communication channels.
- Multi-agent (e.g., Claw-Team, Google ADK, OpenAI Agents SDK): Task is decomposed among 3–7 specialized agents (avg. 4.6), each bound to role-specific tool necessity tiers (required/forbidden/unnecessary) and communication channels. The hub delegates via explicit delegation tools, and spokes execute and return sub-results.
This dual setup surfaces the impact of harness structure on policy adherence and risk propagation.
2. Embedded Safety Constraints and Policy Architecture
Every task instance is equipped with hidden “access_rules” that formalize:
- Tool Authorization: 11,586 role-tool entries (avg. 55.2 per task; 8.5 required, 27.9 forbidden, 18.7 unnecessary) assign binding permissions at the tool invocation level.
- Resource Scope: 3,094 fine-grained argument-level rules whitelist valid resource IDs or glob patterns.
- Information-Flow: 525 policies (105 tasks × 5 perturbations) label allowed communication directions and data classes forbidden to specific roles (e.g., SSNs in Finance, medical histories in Healthcare). By default, only hub↔spoke links are permitted unless specifically overridden.
Policies remain inaccessible to agents; post-run, the judge loads all constraints for verification.
3. Auditing Dimensions, Scoring, and Formal Metrics
HarnessAudit-Bench audits the entire execution trajectory across three layers:
Layer 1: Boundary Compliance (BC)
- Channels: Tool-based (t), resource-based (r), information-flow (f) violations: V-OT (ordinary tools), V-OR (resource-bearing), V-IC (information-channel), V-ID (data disclosure).
- Safety Adherence Rate (SARc):
with severity weights ω_low=0.15, ω_high=0.30.
- Mean Safety Gate:
Layer 2: Execution Fidelity (EF)
- Action Validity Score (AVS): LLM-judged per-action rubric (coverage, precision, resource-scope, minimality; weights 0.30|0.30|0.20|0.20).
- Task Completion Rate (TCR): Weighted sum over 1,647 completion checkpoints.
Layer 3: System Stability (SS)
- Perturbation-based Robustness (PB): Stability under controlled attacks (indirect injection, ambiguous goals, runtime errors).
Aggregate Score
- Harmonizes all axes:
with α=0.7, β=γ=0.15. The safety gate (mean SAR) strictly bounds overall score.
Reported metrics per agent/harness include SARt, SARr, SARf, AVS, TCR, stability subscores, and trade-offs (S@T20/50/80, i.e., average safety adherence at varying completion thresholds). Resource-access compliance consistently underperforms tool-use adherence (SARr ≈ 0.33–0.70; SARt ≈ 0.60–0.85), and information-flow failures are amplified in multi-agent runs (SARf ≈ 0.58) compared to single-agent cases (Liu et al., 14 May 2026).
4. Evaluation Protocol and Case Studies
Each task proceeds through:
- Setup: Load YAML spec, instantiate services, synthesize access rules and checkpoints.
- Execution: Run agent(s) in environment, logging all tool calls, resource accesses, and inter-agent communications into JSONL traces.
- Post-hoc Audit: Apply deterministic scoring (access-checker) and LLM-judged rubrics (for AVS, PB).
Violations are labeled and penalties assigned according to severity and rule type. No audit signals are leaked to agent code at runtime.
Illustrative Violations
- Resource-scope breach (V-OR): Agent invokes a resource not on the whitelist.
- Tool-forbidden breach (V-OT): Role attempts a high-privilege tool denied by policy.
- Information disclosure (V-ID): Out-of-scope user data routed to unauthorized recipient.
- Execution fidelity fault: Workflow skips essential tool steps, as judged by rubric weights.
- Perturbation failure: Canary string (e.g., SQL injection) propagates through agent’s action, flagged during robustness checks.
Emergent findings show that correct final outputs often correlate poorly with safe execution: mid-trajectory breaches occur undetected by terminal-output benchmarks, and violations scale with trajectory length and inter-agent communication.
5. Architecture, Workflow, and Integration Protocols
HarnessAudit-Bench is constructed for modularity and extensibility, generalizing protocols established in BenchGuard (Tu et al., 27 Apr 2026). Its workflow comprises:
- Ingestion Layer: Parses a standardized directory structure into {INST, GT, EVAL, ENV} sets per task.
- Definition-Level Audits: Six-stage protocol verifies task specification, ground-truth alignment, evaluation-logic congruence, instruction quality, and environment integrity, deduplicating findings and calibrating confidence in claims.
- Execution-Level Audits: When agent traces are provided, the system cross-references execution with expected traces, elevating confidence for witnessed misbehaviors and supplying evidence for edge-case bugs invisible to static analysis.
- Consolidation and Reporting: All findings—categorized by source, severity, and confidence (Confirmed, Likely, Possible)—are merged into JSON reports extensible for CI/CD and dashboard triage.
LLM backend abstraction enables audits with any major model, supporting union and weighted-confidence ensemble strategies. Deterministic static checks (e.g., path-sanity, dependency resolution) complement LLM-driven protocols. Empirical studies using ScienceAgentBench and BIXBench-Verified-50 demonstrate exact recall ≥83.3% (A+P, up to 95.8%) for author-confirmed issues at low cost (<$15 for 50 tasks), validating HarnessAudit-Bench’s cross-artifact, execution-integrated approach (Tu et al., 27 Apr 2026).
6. Generalization, Extensibility, and Best Practices
HarnessAudit-Bench’s API and plugin architecture support:
- Custom task and domain ingestion: Harbor-style layouts, converters.
- Verification engine extension: Parameterized definition/execution audit modules, severity thresholds, and taxonomy configuration.
- LLM ensemble configuration: Including confidence fusion policies.
- CI/CD integration: Automated pre-merge audits and triage dashboards.
- Automated test case generation (planned): Minimal reproducers for observed mismatches, with mechanisms for formal confirmation by comparing gold and evaluation script outputs.
Key methodological recommendations include running both static and dynamic audits, benchmarking across agent and harness variants, and calibrating findings through both deterministic and LLM-driven checks to surface subtle misalignment, specification flaws, and harness-induced risk surfaces.
7. Significance and Implications
HarnessAudit-Bench provides, for the first time, granular, interpretable safety profiling of agent harnesses across entire execution trajectories—exposing risks in boundary compliance, unauthorized resource access, and inter-agent information flow that escape final-output-centric evaluation. Empirical findings demonstrate the non-alignment of correct output with safe execution, linear growth of violations with trajectory length, and domain/role-specific risk patterns.
This suggests that policy-conformant harness design, extensive scenario coverage, and automated, layered audit protocols are prerequisites for robust, safety-critical agent deployments. As benchmark and deployment complexity increase, systematic auditing of both agent behaviors and evaluation artifacts—integrating dynamic traces and definition-level protocols as realized in HarnessAudit-Bench—will become a foundational element of trustworthy LLM agent workflows (Liu et al., 14 May 2026, Tu et al., 27 Apr 2026).