Comparing Human Oversight Strategies for Computer-Use Agents

Published 6 Apr 2026 in cs.HC | (2604.04918v1)

Abstract: LLM-powered computer-use agents (CUAs) are shifting users from direct manipulation to supervisory coordination. Existing oversight mechanisms, however, have largely been studied as isolated interface features, making broader oversight strategies difficult to compare. We conceptualize CUA oversight as a structural coordination problem defined by delegation structure and engagement level, and use this lens to compare four oversight strategies in a mixed-methods study with 48 participants in a live web environment. Our results show that oversight strategy more reliably shaped users' exposure to problematic actions than their ability to correct them once visible. Plan-based strategies were associated with lower rates of agent problematic-action occurrence, but not equally strong gains in runtime intervention success once such actions became visible. On subjective measures, no single strategy was uniformly best, and the clearest context-sensitive differences appeared in trust. Qualitative findings further suggest that intervention depended not only on what controls users retained, but on whether risky moments became legible as requiring judgment during execution. These findings suggest that effective CUA oversight is not achieved by maximizing human involvement alone. Instead, it depends on how supervision is structured to surface decision-critical moments and support their recognition in time for meaningful intervention.

Abstract PDF Upgrade to Chat

Authors (12)

Summary

The paper proposes a unified design space that organizes oversight strategies along delegation structure and engagement level.
Evaluation of four techniques reveals that plan-based approaches reduce problematic actions but show limited improvement in runtime correction.
Interface cues like risk labeling and contextual adaptation are critical for aligning human intervention with agent behavior.

Comparative Analysis of Human Oversight Strategies for LLM-Powered Computer Use Agents

Conceptualizing the Oversight Design Space

The proliferation of LLM-powered Computer-Use Agents (CUAs) necessitates systematic approaches for human oversight to mitigate the risks of privacy leakage, prompt injection, and susceptibility to interface dark patterns. This work proposes a unified design space for organizing and comparing CUA oversight strategies, parameterized along two dimensions: delegation structure (agent-led to human-controlled) and engagement level (step-level to plan-level). This conceptual framework enables the formal breakdown of existing and novel oversight mechanisms and guides empirical assessment of their efficacy.

Figure 1: The oversight strategy design space, parameterized by delegation structure and engagement level, positioning representative strategies and their hybridizations.

Four oversight strategies instantiate this space: Risk-Gated, Action Confirmation, Supervisory Co-Execution, and Structurally Enriched. Each places oversight responsibilities at different loci of agent-user interaction, thus producing distinct patterns of exposure to—and potential mitigation of—problematic agent actions.

Interface Instantiations and Task Contexts

Each oversight strategy is concretized as a browser-based interface for supervising a Gemini 3.1-powered agent acting on real-world web tasks. The interfaces operationalize the delegation and engagement structure—for example, Supervisory Co-Execution exposes a plan-level review and hierarchical execution trace whereas Action Confirmation integrates per-action user approval dialogs.

Figure 2: Annotated interface screenshots illustrating oversight strategy implementations and user interaction points.

Evaluation tasks target high-consequence (finance, travel, public benefits) and low-consequence (food ordering, ticketing, review) domains, each embedding a unique attack vector (privacy leakage, prompt injection, or dark pattern).

Figure 3: Sample high-stakes loan pre-qualification task with a privacy leakage attack vector through unnecessary sensitive information request.

Figure 4: Typical prompt injection attack in the food ordering task, where injected interface content attempts to manipulate agent selection logic.

Empirical Methodology

A within-subjects experiment with 48 participants probes the interplay between oversight strategy, task context, subjective workload/control/trust/usability, and behavioral outcomes (problematic action exposure, intervention rates). Experimental control is maintained by uniform agent behavior (modulo approval constraints) and fixed adversarial stimuli across all strategy conditions.

Quantitative Assessment: Behavioral and Subjective Findings

The comparative data reveal critical dissociations:

Exposure vs. Correction: Plan-based strategies (Supervisory Co-Execution, Structurally Enriched) substantially reduce the rate at which problematic actions reach execution (e.g., 60.4% and 74.5% attack occurrence, vs. ≥88% for non-plan-based). However, user intervention success rates, once an attack surfaces, are not significantly higher for these strategies. Final attack prevention remains incomplete, indicating a recognition and action bottleneck even under conditions of enhanced oversight.
Contextual Modulation of Subjective Measures: No strategy yields uniformly superior workload, trust, or usability across all task contexts. Structurally Enriched oversight delivers highest trust and usability in higher-consequence settings, but this is reversed or attenuated in lower-stakes domains. These effects are pronounced for trust and context-dependent, suggesting subjective-empowerment alignment is nontrivial.
Figure 5: Distribution of subjective workload, trust, control, and usability scores as a function of oversight strategy and task context.

Qualitative Insights: Mechanisms of (In)Effective Oversight

Interview analysis reveals that effective human intervention is not trivially achieved by augmenting approval checkpoints or transparency.

Delegation Boundaries over Global Trust: Users calibrate their engagement dynamically, locally redefining oversight boundaries in response to perceived task criticality, routine vs. exceptional actions, and aggregation across interaction sessions.
Recognition as Bottleneck: Failures to intervene typically reflect interpretation—agents’ risky actions are often perceived as routine or justifiable given interface or domain context rather than simply missed due to inattention.
Interface Cues Drive Legibility: Risk labeling, salient warnings, and explicit "needs judgment" signals are necessary but insufficient—legibility of judgment-relevant moments, as shaped by the interface, dominates both attention allocation and the perceived need for intervention.

Theoretical and Practical Implications

These findings substantially complicate the sufficiency of current regulatory or best-practice requirements centered on preserving "meaningful human control" in CUA architectures. No observed oversight strategy achieved strong guarantees of both reduced risk exposure and robust correction capacity. Merely increasing human involvement (action gating, granular exposure of internal state) does not monotonically improve safety; instead, the timing, specificity, and interpretability of human involvement must be strategically structured. The results reinforce prior automation literature around the "out-of-the-loop" effect, mode error, and automation bias, and empirically extend challenges noted in classic studies to the domain of generative, interface-level agents [10.1518/001872097778543886, 10.1109/3468.844354, BAINBRIDGE1983775].

The dissociation between reduced exposure (via plan-level oversight) and weak runtime correction highlights that surface-level metrics such as attack occurrence can be misleading proxies for true oversight efficacy. Interfaces supporting flexible, context-sensitive fluidity between plan and step-level engagement (Structurally Enriched) may yield better alignment with user needs in high-consequence settings, but empirical evidence indicates this is far from a panacea.

Limitations and Future Research Trajectories

Conflation between consequence level and attack vector (privacy leakage vs. prompt injection/dark patterns), selection-based measurement of intervention efficacy (observable only when attacks surface), and possible confounding by individual participant attention or strategy preferences all require further exploration. Broadly, designs for CUA oversight must move beyond static checkpoint-based paradigms towards dynamic, contextualized delegation structures, robust risk legibility mechanisms, and perhaps adaptive engagement driven by real-time task and user modeling.

Conclusion

This work rigorously delineates the multidimensional design space of human oversight for LLM-powered computer-use agents. The primary determinant of oversight efficacy is not the quantum of human control or transparency, but rather the structure by which consequential moments are surfaced and rendered actionable for human judgment. The potential for dynamic oversight mechanisms—capable of context- and consequence-sensitive adaptation—emerges as a critical line for future research in reliable and safe integration of powerful CUA systems in high-stakes workflows.

Markdown Report Issue