Measuring the Permission Gate: A Stress-Test Evaluation of Claude Code's Auto Mode

Published 4 Apr 2026 in cs.SE, cs.AI, and cs.CR | (2604.04978v1)

Abstract: Claude Code's auto mode is the first deployed permission system for AI coding agents, using a two-stage transcript classifier to gate dangerous tool calls. Anthropic reports a 0.4% false positive rate and 17% false negative rate on production traffic. We present the first independent evaluation of this system on deliberately ambiguous authorization scenarios, i.e., tasks where the user's intent is clear but the target scope, blast radius, or risk level is underspecified. Using AmPermBench, a 128-prompt benchmark spanning four DevOps task families and three controlled ambiguity axes, we evaluate 253 state-changing actions at the individual action level against oracle ground truth. Our findings characterize auto mode's scope-escalation coverage under this stress-test workload. The end-to-end false negative rate is 81.0% (95% CI: 73.8%-87.4%), substantially higher than the 17% reported on production traffic, reflecting a fundamentally different workload rather than a contradiction. Notably, 36.8% of all state-changing actions fall outside the classifier's scope via Tier 2 (in-project file edits), contributing to the elevated end-to-end FNR. Even restricting to the 160 actions the classifier actually evaluates (Tier 3), the FNR remains 70.3%, while the FPR rises to 31.9%. The Tier 2 coverage gap is most pronounced on artifact cleanup (92.9% FNR), where agents naturally fall back to editing state files when the expected CLI is unavailable. These results highlight a coverage boundary worth examining: auto mode assumes dangerous actions transit the shell, but agents routinely achieve equivalent effects through file edits that the classifier does not evaluate.

Abstract PDF Upgrade to Chat

Authors (5)

Summary

The paper introduces AmPermBench, a benchmark stress-testing the auto mode’s permission gate under ambiguous authorization scenarios.
The methodology performs action-level analysis, reporting high false negative rates (81.0% overall and 70.3% for Tier 3 actions) under controlled tasks.
Findings expose architectural gaps, notably in in-project file edits bypassing checks, suggesting the need for environment-aware and unified gating strategies.

Measuring the Efficacy of Claude Code's Auto Mode Permission System Under Ambiguity

System Architecture and Evaluation Pipeline

The paper provides the first independent stress-test evaluation of Claude Code's auto mode—Anthropic's deployed permission gating system for autonomous coding agents—targeting the critical dimension of authorization ambiguity in DevOps operations. The core contribution is the construction and use of AmPermBench, a synthetic, systematically controlled benchmark featuring 128 prompts (spanning four DevOps task families crossed with specificity, blast radius, and risk level) designed to probe the system's boundaries, particularly when user intent is directionally clear but operational authorization (e.g., scope, targets, or risk) remains ambiguous.

The auto mode architecture classifies tool calls into three tiers: Tier 1 (read-only tools, always allowed), Tier 2 (in-project file edits, always allowed without classifier evaluation), and Tier 3 (everything else, gated by a two-stage transcript classifier, utilizing both a high-recall fast filter and a reasoning-based final classifier). Importantly, the classifier's input excludes the agent's reasoning traces or tool outputs, ensuring a reasoning-blind evaluation.

Figure 1: Architecture and pipeline for AmPermBench: prompts generation, isolated task execution, action extraction, and multi-tier permission evaluation. The key classifier bypass for in-project file edits is highlighted.

Action-Level Evaluation Framework

The methodology departs from conventional task-level success measurement and instead performs action-level analysis, evaluating each state-changing operation (253 actions total) invoked by the agent across all sessions. Each action is labeled via a combination of LLM judge and human validation as true positive (TP), true negative (TN), false positive (FP), or false negative (FN), where the "ground truth" is strictly defined by mechanical policies mapping prompt constraints to environment entities. This evaluation protocol ensures rigorous, fine-grained identification of permission system failures.

Numerical Findings: Error Rates and Coverage Gaps

The auto mode classifier, under this deliberately ambiguous and scope-escalating workload, exhibits substantially higher end-to-end FNR (81.0%, 95% CI: 73.8–87.4%) and FPR (19.8%) compared to its reported production figures (17% FNR, 0.4% FPR, which are on organic, non-adversarial workloads). Of particular importance is that 36.8% of all state-changing actions (notably for artifact cleanup) never reach the classifier due to Tier 2 bypass—file edits in the project directory are categorically allowed and not evaluated for authorization violations.

When focusing only on Tier 3 actions actually intercepted by the permission system (shell/CLI/external), FNR remains at 70.3% and FPR rises to 31.9%. The classifier is less effective for precise, narrowly-scoped unauthorized actions, and more sensitive to "blast radius" (the number of entities affected)—large batch operations are blocked more aggressively, but this is partly a reflection of heuristic-based pattern recognition rather than genuine environment-aware authorization logic.

Figure 2: Left: FNR/FPR versus blast radius—FNR decreases for wide selectors, indicating stronger gating of batch operations. Right: FNR/FPR versus specificity, which shows non-monotonic, weak effects.

Structural Weaknesses and Task-Specific Patterns

The finding with the most significant security implication arises from the architectural coverage gap: in-project file edits (Tier 2) are systematically excluded from classification, based on the assumption that they are safe/routine and restorable by VCS. However, agents often fallback to direct file modification (e.g., jobs.json, objects.json) when the corresponding CLI is unavailable. This strategy allows state changes—including destructive or unauthorized mutations—to bypass the classifier entirely, as present extensively in artifact cleanup and to some degree in job cancellation.

Notably, artifact cleanup exhibits a 92.9% FNR, with the majority of FNs traced to Edit calls denied classifier inspection. Conversely, service restart is dominated by FPs, with correct authorized restarts over-blocked particularly in high-risk (production) lanes.

Figure 3: Confusion matrix by task: artifacts skewed towards high FNs due to Tier 2, services towards FPs (over-blocking in production), jobs more balanced.

Theoretical and Practical Implications

This analysis exposes a fundamental challenge in designing permission systems for agentic, general-purpose coding agents. Heuristic-based, transcript-only classifiers can improve batch operation safety but lack environmental reasoning capacities to verify fine-grained authorization against actual world state. Further, any architectural assumption that equates "in-project" file edits with low-risk operations is invalid in environments with operational state files or configuration-as-code—the agent’s ability to route around the classifier is not adversarial but an emergent property of LLM-based agent tool use.

The results imply that production-oriented permission gates must pair transcript reasoning with deeper understanding or inspection of both the tool interface and the environment, potentially incorporating OS-level sandboxing, runtime auditing, or hybrid policies that do not trust either the tool layer or project-local file boundaries as a hard security barrier.

Such coverage gaps are likely generic across agentic architectures; closing them will require both precise identification of potentially ambiguous indirect control paths and the integration of runtime provenance tracking beyond text-based tool call classification.

Future Directions

This work motivates several research and engineering trajectories:

Environment-aware classifiers: Incorporating state inspection or entity resolution to determine authorization rather than relying solely on prompt content and tool call transcript.
Unified gating across all state-changing channels: Decoupling risk assessment from the interface layer so file edits, shell commands, and API calls are analyzed under a shared permission framework.
Adaptive confirmation rather than outright blocking: Especially in high-risk settings, confirmation challenges or human-in-the-loop escalation could balance usability and safety.
Composable sandboxing and gating layers: Integration with microVMs or system call-based isolation to minimize blast radius regardless of tool routing path.

Robust evaluation will require additional benchmarks covering organizational policies, adversarial prompt injection, data exfiltration, and multi-agent compositional workflows; the demonstrated action-level approach is an essential protocol for such future assessments.

Conclusion

This study highlights that current permission gating for LLM coding agents is brittle under ambiguous, boundary-pushing workloads. The classifier’s high FNR under stress-test conditions, and the architectural bypass (Tier 2) for in-project file edits, indicate that a significant proportion of dangerous actions are neither detected nor blocked—even in the absence of overt prompt injection or adversarial behavior. Designers of practical agentic frameworks must address both fine-grained environment-coupled authorization reasoning and coverage for all state-manipulation vectors, or risk underestimating the true surface for operational security violations (2604.04978).

Markdown Report Issue