- The paper introduces AmPermBench, a benchmark stress-testing the auto mode’s permission gate under ambiguous authorization scenarios.
- The methodology performs action-level analysis, reporting high false negative rates (81.0% overall and 70.3% for Tier 3 actions) under controlled tasks.
- Findings expose architectural gaps, notably in in-project file edits bypassing checks, suggesting the need for environment-aware and unified gating strategies.
Measuring the Efficacy of Claude Code's Auto Mode Permission System Under Ambiguity
System Architecture and Evaluation Pipeline
The paper provides the first independent stress-test evaluation of Claude Code's auto mode—Anthropic's deployed permission gating system for autonomous coding agents—targeting the critical dimension of authorization ambiguity in DevOps operations. The core contribution is the construction and use of AmPermBench, a synthetic, systematically controlled benchmark featuring 128 prompts (spanning four DevOps task families crossed with specificity, blast radius, and risk level) designed to probe the system's boundaries, particularly when user intent is directionally clear but operational authorization (e.g., scope, targets, or risk) remains ambiguous.
The auto mode architecture classifies tool calls into three tiers: Tier 1 (read-only tools, always allowed), Tier 2 (in-project file edits, always allowed without classifier evaluation), and Tier 3 (everything else, gated by a two-stage transcript classifier, utilizing both a high-recall fast filter and a reasoning-based final classifier). Importantly, the classifier's input excludes the agent's reasoning traces or tool outputs, ensuring a reasoning-blind evaluation.
Figure 1: Architecture and pipeline for AmPermBench: prompts generation, isolated task execution, action extraction, and multi-tier permission evaluation. The key classifier bypass for in-project file edits is highlighted.
Action-Level Evaluation Framework
The methodology departs from conventional task-level success measurement and instead performs action-level analysis, evaluating each state-changing operation (253 actions total) invoked by the agent across all sessions. Each action is labeled via a combination of LLM judge and human validation as true positive (TP), true negative (TN), false positive (FP), or false negative (FN), where the "ground truth" is strictly defined by mechanical policies mapping prompt constraints to environment entities. This evaluation protocol ensures rigorous, fine-grained identification of permission system failures.
Numerical Findings: Error Rates and Coverage Gaps
The auto mode classifier, under this deliberately ambiguous and scope-escalating workload, exhibits substantially higher end-to-end FNR (81.0%, 95% CI: 73.8–87.4%) and FPR (19.8%) compared to its reported production figures (17% FNR, 0.4% FPR, which are on organic, non-adversarial workloads). Of particular importance is that 36.8% of all state-changing actions (notably for artifact cleanup) never reach the classifier due to Tier 2 bypass—file edits in the project directory are categorically allowed and not evaluated for authorization violations.
When focusing only on Tier 3 actions actually intercepted by the permission system (shell/CLI/external), FNR remains at 70.3% and FPR rises to 31.9%. The classifier is less effective for precise, narrowly-scoped unauthorized actions, and more sensitive to "blast radius" (the number of entities affected)—large batch operations are blocked more aggressively, but this is partly a reflection of heuristic-based pattern recognition rather than genuine environment-aware authorization logic.

Figure 2: Left: FNR/FPR versus blast radius—FNR decreases for wide selectors, indicating stronger gating of batch operations. Right: FNR/FPR versus specificity, which shows non-monotonic, weak effects.
Structural Weaknesses and Task-Specific Patterns
The finding with the most significant security implication arises from the architectural coverage gap: in-project file edits (Tier 2) are systematically excluded from classification, based on the assumption that they are safe/routine and restorable by VCS. However, agents often fallback to direct file modification (e.g., jobs.json, objects.json) when the corresponding CLI is unavailable. This strategy allows state changes—including destructive or unauthorized mutations—to bypass the classifier entirely, as present extensively in artifact cleanup and to some degree in job cancellation.
Notably, artifact cleanup exhibits a 92.9% FNR, with the majority of FNs traced to Edit calls denied classifier inspection. Conversely, service restart is dominated by FPs, with correct authorized restarts over-blocked particularly in high-risk (production) lanes.
Figure 3: Confusion matrix by task: artifacts skewed towards high FNs due to Tier 2, services towards FPs (over-blocking in production), jobs more balanced.
Theoretical and Practical Implications
This analysis exposes a fundamental challenge in designing permission systems for agentic, general-purpose coding agents. Heuristic-based, transcript-only classifiers can improve batch operation safety but lack environmental reasoning capacities to verify fine-grained authorization against actual world state. Further, any architectural assumption that equates "in-project" file edits with low-risk operations is invalid in environments with operational state files or configuration-as-code—the agent’s ability to route around the classifier is not adversarial but an emergent property of LLM-based agent tool use.
The results imply that production-oriented permission gates must pair transcript reasoning with deeper understanding or inspection of both the tool interface and the environment, potentially incorporating OS-level sandboxing, runtime auditing, or hybrid policies that do not trust either the tool layer or project-local file boundaries as a hard security barrier.
Such coverage gaps are likely generic across agentic architectures; closing them will require both precise identification of potentially ambiguous indirect control paths and the integration of runtime provenance tracking beyond text-based tool call classification.
Future Directions
This work motivates several research and engineering trajectories:
- Environment-aware classifiers: Incorporating state inspection or entity resolution to determine authorization rather than relying solely on prompt content and tool call transcript.
- Unified gating across all state-changing channels: Decoupling risk assessment from the interface layer so file edits, shell commands, and API calls are analyzed under a shared permission framework.
- Adaptive confirmation rather than outright blocking: Especially in high-risk settings, confirmation challenges or human-in-the-loop escalation could balance usability and safety.
- Composable sandboxing and gating layers: Integration with microVMs or system call-based isolation to minimize blast radius regardless of tool routing path.
Robust evaluation will require additional benchmarks covering organizational policies, adversarial prompt injection, data exfiltration, and multi-agent compositional workflows; the demonstrated action-level approach is an essential protocol for such future assessments.
Conclusion
This study highlights that current permission gating for LLM coding agents is brittle under ambiguous, boundary-pushing workloads. The classifier’s high FNR under stress-test conditions, and the architectural bypass (Tier 2) for in-project file edits, indicate that a significant proportion of dangerous actions are neither detected nor blocked—even in the absence of overt prompt injection or adversarial behavior. Designers of practical agentic frameworks must address both fine-grained environment-coupled authorization reasoning and coverage for all state-manipulation vectors, or risk underestimating the true surface for operational security violations (2604.04978).