AgentContract-Bench Benchmark Suite

Updated 2 July 2026

AgentContract-Bench is a multi-dimensional benchmark suite that defines and tests AI agents' ability to comply with formal contracts, covering behavioral, observation, and role-based specifications.
It organizes 200 multi-step scenarios across three tiers—agent domain, governance-stress, and composition—to rigorously assess contract integrity and multi-agent coordination under adversarial conditions.
Experimental evaluations with leading AI models reveal high hard compliance, measured recovery rates, and quantifiable drift metrics, providing actionable insights into agent reliability and system robustness.

AgentContract-Bench is an open-source, multi-dimensional benchmark suite for evaluating AI agents’ compliance with formal contracts—spanning behavioral specifications, observation contract integrity, structured generation pipelines, and rigorous multi-agent role separation. It provides a rigorous substrate to systematically assess agent reliability, compositionality, and coordination in settings where correct contract satisfaction is essential and error modes have quantifiable downstream impact. The benchmark is referenced across multiple research threads, including behavioral contract enforcement, agent pipeline synthesis, contract compliance with temporal and byte-level constraints, and enforced role separation in agent teams (Bhardwaj, 25 Feb 2026, Goel et al., 14 Feb 2026, Wang et al., 17 May 2026, Kim et al., 8 May 2026).

1. Benchmark Organization and Scenario Suite

AgentContract-Bench consists of 200 multi-step scenarios, each crafted to exercise diverse contract-theoretic constructs and adversarial contexts (Bhardwaj, 25 Feb 2026). The benchmark is categorized into three tiers:

Agent-Domain Tier (100 scenarios): Covers finance (PII, risk disclaimers), customer support (escalations, refund policies), code generation (secret handling, injection and license safety), research synthesis (citation and fabrication checks), and healthcare triage (scope, privacy compliance).
Governance-Stress Tier (50 scenarios): Executes domain contracts under prompt injection, tool failure, conflicting instructions, time/resource pressure, and social engineering stressors.
Composition Tier (50 scenarios): Tests contract composition (interface compatibility, assumption discharge, governance, and recovery independence) in multi-stage processes (e.g., loan pipelines).

Each scenario includes 5–8 agent steps. Tasks vary in difficulty (easy/medium/hard) determined by violation subtlety, constraint concurrency, and context depth. Sample tasks include “provide a retirement-portfolio rebalance without exposing client SSNs” and “draft a compliant loan decision.”

2. Formal Specification: Behavioral Contracts, Metrics, and Drift

The benchmark is grounded in the Agent Behavioral Contracts (ABC) formalism. Each contract C = (P, I, G, R) specifies (Bhardwaj, 25 Feb 2026):

P: Preconditions
I: Invariants, split into hard (Iₕard) and soft (Iₛoft)
G: Governance policies, including hard (Gₕard) and soft (Gₛoft)
R: Recovery mechanisms

Compliance is evaluated using (p, δ, k)-satisfaction. Agents must satisfy:

Hard guarantee:

$P[\forall t \leq T: C_{hard}(t) = 1 ] \geq p$

Soft guarantee:

$P[ \forall t \leq T: C_{soft}(t) < 1-\delta \implies \exists t' \in [t, t+k]: C_{soft}(t') \geq 1-\delta ] \geq p$

Behavioral drift $D(t)$ is modeled via the Ornstein–Uhlenbeck process,

$dD(t) = (\alpha - \gamma D(t)) dt + \sigma dW(t)$

yielding long-run drift bound $D^* = \alpha / \gamma$ and Gaussian tail concentration, enabling contract designers to reason about reliability trade-offs and composition.

3. Experimental Methodology and Evaluation

Seven foundation and frontier model families are evaluated, including OpenAI GPT-5.2, Claude Opus 4.6 (Anthropic), Llama 3.3 70B (Meta), Mistral Large 3, DeepSeek-R1, and Grok 4 Fast (xAI), across 1,980 live-agent sessions (Bhardwaj, 25 Feb 2026). Four principal experiments are performed:

E1: 6-turn financial advisor tasks with and without contract enforcement.
E2: 12-turn sessions to measure drift bounds in extended scenarios.
E3: Stress tests under adversarial governance profiles.
E4: Ablation studies to analyze constraint recovery.

Key empirical results:

Contracted agents surface 5.2–6.8 soft violations per session that baselines miss.
Hard constraint compliance reaches 88–100%; drift D_max < 0.27 across extended sessions.
Recovery rate for soft violations: 100% (frontier models) to 17% (all models).
Overhead for runtime contract checks is <10 ms per action (<1% of inference latency).
Reliability index $\Theta$ exceeds 0.95 in non-compositional domains; hardest cases are in composition ( $\Theta = 0.8865$ ).

4. Observation Contract Compliance: Temporal and Byte-Level Integrity

AgentContract-Bench inherits and extends concepts from ContractBench (Wang et al., 17 May 2026), defining observation contracts as artifacts $C=(o, t_{issue}, \tau, \pi)$ , with time windows and integrity predicates. Two axes—temporal validity and byte-level integrity—are independent, and compliance must be preserved jointly across sequences:

Validity failure: Use outside $[t_{issue}, t_{issue}+\tau)$ .
Integrity failure: $\pi(o')=0$ (e.g., mutated artifact).

The 33 dual-axis tasks probe failure quadrants: low/high validity, low/high integrity, and mixed-pressure scenarios matching real-world APIs (presigned URLs, OAuth, state-chain tokens). No model in evaluation exceeds 80% compliance (Opus 4.6 at 77.8%), with post-training regressions observed (e.g., GPT-5.1). In-context label coaching, using a 15-label failure taxonomy, yields measurable recovery (+7.1 pp on GPT-5.1) (Wang et al., 17 May 2026).

5. Structured Pipeline Evaluation for Smart Contracts

AgentContract-Bench applies an agentic, CrewAI-inspired sequential pipeline for translating unstructured requirements into Solidity with iterative security remediation and quality grading (Goel et al., 14 Feb 2026):

Phase 1: Extract UniversalContractSchema from NL using exact strings for variable, function, and state names.
Phase 2: Generate Solidity with enforced code-generation rules, explicit state-machine logic, and avoidance of forbidden constructs.
Phase 3: Invoke automated LLM security audit.
Phase 4: Apply refiner loop for up to 2 constraint-driven remediation cycles.
Phase 5–7: Generate ABI, serve on MCP, and produce a composite quality score.

Quality is assessed on five axes: functional completeness, variable/parameter fidelity, state-machine correctness, business-logic fidelity, and code quality. Composite scores align with expert benchmarks—on the FSM-SCG dataset: $P[ \forall t \leq T: C_{soft}(t) < 1-\delta \implies \exists t' \in [t, t+k]: C_{soft}(t') \geq 1-\delta ] \geq p$ 0 (B letter grade), $P[ \forall t \leq T: C_{soft}(t) < 1-\delta \implies \exists t' \in [t, t+k]: C_{soft}(t') \geq 1-\delta ] \geq p$ 1 compilation rate, critical vulnerability count reduction ( $P[ \forall t \leq T: C_{soft}(t) < 1-\delta \implies \exists t' \in [t, t+k]: C_{soft}(t') \geq 1-\delta ] \geq p$ 2), and detailed error-mode frequencies (e.g., $P[ \forall t \leq T: C_{soft}(t) < 1-\delta \implies \exists t' \in [t, t+k]: C_{soft}(t') \geq 1-\delta ] \geq p$ 3 logic omissions).

6. Multi-Agent Role Separation and Contract Enforcement

AgentContract-Bench is realized at scale in TeamBench (Kim et al., 8 May 2026), which evaluates agent teams under strict OS-level role isolation:

Roles: Planner (complete spec access), Executor (restricted brief and tool access), Verifier (spec+workspace, attest only).
Tasks: 931 seeded instances, 19 base and 21 refined categories; role-violation rate, pass/fail rate, partial scores, and false-accept metrics quantified.
Key findings:
- Pass rates do not distinguish between prompt-only and strict enforcement (~40%); however, OS sandboxing reduces cross-role violations and Verifier code-edit attempts (3.6× reduction).
- Team value is conditional: teams outperform only when solo agent capability is low.
- Verifier accepts 49% of failed submissions; removing Verifier can improve mean score.
- Human-agent hybrid and team studies reveal coordination bottlenecks not reflected in pass rates alone.

7. Significance, Limitations, and Future Directions

AgentContract-Bench operationalizes formal contracting—behavioral, observation, structural, and multi-agent—in agentic systems, bridging specification to runtime enforcement. Its extensibility, deterministic evaluation, and broad scenario coverage set a baseline for rigorous, reproducible benchmarks.

Limitations include scenario representativeness (hand-crafted, non-random), deterministic constraint layouts, bounded recovery/iteration cycles, and domain specificity (contracts, coordination, governance). A plausible implication is that as the field progresses, further extensions will be necessary to cover open-world, adversarial, multi-chain, and economically motivated agent settings (Wang et al., 5 Mar 2026, Huang et al., 24 Jun 2026).

The benchmark suite is open source, with documented agents, orchestration scripts, schemas, and reproducibility guarantees, supporting further research in agent safety, reliability, and coordinated autonomy.