Magentic-UI: Towards Human-in-the-loop Agentic Systems (2507.22358v1)

Published 30 Jul 2025 in cs.AI and cs.HC

Abstract: AI agents powered by LLMs are increasingly capable of autonomously completing complex, multi-step tasks using external tools. Yet, they still fall short of human-level performance in most domains including computer use, software development, and research. Their growing autonomy and ability to interact with the outside world, also introduces safety and security risks including potentially misaligned actions and adversarial manipulation. We argue that human-in-the-loop agentic systems offer a promising path forward, combining human oversight and control with AI efficiency to unlock productivity from imperfect systems. We introduce Magentic-UI, an open-source web interface for developing and studying human-agent interaction. Built on a flexible multi-agent architecture, Magentic-UI supports web browsing, code execution, and file manipulation, and can be extended with diverse tools via Model Context Protocol (MCP). Moreover, Magentic-UI presents six interaction mechanisms for enabling effective, low-cost human involvement: co-planning, co-tasking, multi-tasking, action guards, and long-term memory. We evaluate Magentic-UI across four dimensions: autonomous task completion on agentic benchmarks, simulated user testing of its interaction capabilities, qualitative studies with real users, and targeted safety assessments. Our findings highlight Magentic-UI's potential to advance safe and efficient human-agent collaboration.

Summary

The paper presents its main contribution in developing Magentic-UI as a framework that blends human oversight with autonomous agents to enhance safety and task performance.
It details a multi-agent architecture supporting co-planning, dynamic co-tasking, and action approval, validated through extensive benchmarks and user studies.
Empirical evaluations demonstrate that minimal human feedback can significantly boost task completion rates while layered safety defenses mitigate risks like prompt injection.

Magentic-UI: A Human-in-the-Loop Framework for Agentic Systems

Introduction and Motivation

Magentic-UI addresses the persistent gap between the capabilities of autonomous LLM-based agents and the requirements for safe, reliable, and effective deployment in real-world, high-stakes environments. While LLM agents have demonstrated progress in automating complex, multi-step tasks involving web browsing, code execution, and file manipulation, they remain susceptible to misalignment, ambiguity, and adversarial manipulation. The paper posits that human-in-the-loop (HITL) agentic systems, which combine human oversight with agentic efficiency, are a necessary paradigm for advancing both productivity and safety in agentic AI.

Figure 1: Magentic-UI is an open-source research prototype of a human-centered agent that is meant to help researchers paper open questions on human-in-the-loop approaches and oversight mechanisms for AI agents.

System Overview and Interaction Mechanisms

Magentic-UI is an open-source, extensible web interface built atop a multi-agent architecture (adapted from Magentic-One) and designed to facilitate research into HITL agentic systems. The system supports a broad action space, including web browsing, code execution, and file operations, and is extensible via the Model Context Protocol (MCP) for tool integration. The user is modeled as a first-class agent within the multi-agent team, enabling seamless human-agent collaboration.

Figure 2: The Magentic-UI interface displaying a task in progress, with session management, agent updates, user input, and a live browser view.

Magentic-UI operationalizes six core interaction mechanisms:

Co-planning: Human and agent collaboratively construct and edit a plan before execution, resolving ambiguity and incorporating human priors.
Co-tasking: Dynamic handoff of control between agent and user during execution, supporting interventions, clarifications, and error recovery.
Action Approval (Action Guard): LLM- and heuristic-based gating of potentially irreversible or high-impact actions, requiring explicit user approval.
Answer Verification: Post-hoc validation of agent outputs via trace inspection and follow-up queries.
Memory: Persistent storage and retrieval of learned plans, enabling efficient repetition and adaptation of workflows.
Multi-tasking: Parallel execution and oversight of multiple agentic sessions.
Figure 3: The plan editor component in Magentic-UI, enabling direct user edits and plan acceptance prior to execution.

Figure 4: Co-tasking in Magentic-UI: (a) user interrupts agent, (b) agent interrupts user for clarification, (c) final answer presentation.

Figure 5: The saved plans view, supporting plan reuse, editing, and rerun for long-term memory.

System Architecture and Implementation

Magentic-UI's architecture comprises three principal components:

Multi-Agent Team: Orchestrator agent coordinates sub-agents (WebSurfer, Coder, FileSurfer, UserProxy, MCP agents), each with defined capabilities and sandboxed execution (Docker isolation).
User Interface: Real-time, session-based UI with embedded browser, plan editor, action trace, and notification system.
Backend: Session and state management, persistent storage (SQLite), and configuration management.
Figure 6: Overall System Architecture of Magentic-UI.

The orchestrator operates in two modes: planning (plan generation and negotiation) and execution (stepwise delegation, progress tracking, and dynamic replanning). The agent protocol enforces multimodal I/O and explicit capability descriptions, supporting modular extensibility.

Figure 7: Simplified Orchestrator loop, illustrating planning, execution, and dynamic replanning.

The action guard system combines static heuristics and an LLM-based judge to mediate agent actions, enforcing user oversight for irreversible or potentially harmful operations.

Figure 8: Magentic-UI implements an action guard system to ensure irreversible or potentially harmful agent actions are reviewed by the human user.

Empirical Evaluation

Autonomous Agent Performance

Magentic-UI was evaluated on four agentic benchmarks: GAIA, AssistantBench, WebVoyager, and WebGames. In autonomous mode (no human-in-the-loop), Magentic-UI matches the performance of Magentic-One on GAIA and AssistantBench, and achieves competitive results on WebVoyager and WebGames, with 82.2% and 45.5% task completion rates, respectively, using o4-mini and GPT-4o. However, it remains below human-level performance and the best specialized baselines.

Figure 9: Distribution of the run time in seconds of Magentic-UI on the WebVoyager dataset, stratified by task correctness.

Figure 10: Analysis of Magentic-UI planning statistics: re-planning frequency, plan length, and distribution across benchmarks.

Human-in-the-Loop and Simulated User Studies

Simulated user experiments on GAIA demonstrate that HITL can substantially improve agent performance. With a simulated user possessing side information, Magentic-UI's task completion rate increases by 71% (from 30.3% to 51.9%), with minimal user intervention (help requested in only 10% of tasks). Even a simulated user with a more capable LLM (but no side information) yields a 42.6% completion rate.

Figure 11: Comparison on the GAIA validation set: HITL Magentic-UI bridges the gap to human performance at a fraction of the cost.

A qualitative user paper (N=12) found that users value co-planning for aligning agent actions with subjective preferences, co-tasking for error recovery and control, and action guards for critical decisions. The System Usability Scale (SUS) score was 74.58, with 75% of participants finding the system easy to use, though only 41.7% expressed interest in frequent use.

Figure 12: System Usability Scale results: high usability, but mixed interest in frequent use.

Safety and Security

Magentic-UI underwent targeted adversarial testing, including direct requests for sensitive data, social engineering, and prompt injection attacks. With default mitigations (action guard, sandboxing, isolated browser), no adversarial scenario resulted in a successful exploit. Disabling mitigations revealed that prompt injection remains a reliable attack vector, underscoring the necessity of layered defenses.

Figure 13: In scenario social_eng_03, Magentic-UI encounters a phishing popup, identifies the threat, and waits for user approval.

Figure 14: In scenario social_eng_01, Magentic-UI pauses execution and requests user approval when encountering egregious OAuth requests.

Figure 15: In scenario injection_web_01, Magentic-UI detects a prompt injection targeting SSH keys and pauses for user approval.

Discussion and Implications

Magentic-UI demonstrates that explicit HITL mechanisms can improve agent reliability, safety, and user trust, even when underlying agentic capabilities are imperfect. The co-planning interface enables up-front alignment, but the natural language plan DSL limits expressivity (e.g., no branching or parallelism). Action guards and progress indicators support oversight, but tuning the frequency and granularity of interruptions remains an open problem. The memory system enables workflow reuse, but does not yet address user preference or context adaptation.

The empirical results highlight that even lightweight human feedback can yield substantial gains in agentic task completion, and that most of the benefit can be realized with minimal user intervention. However, latency, verbosity, and model errors remain significant pain points, and the system's performance is still below that of expert humans, especially on tasks requiring advanced coding, multimodal reasoning, or long action sequences.

From a safety perspective, the necessity of strong isolation, explicit approval, and defense-in-depth is empirically validated. Prompt injection and social engineering remain active threats, and future work must address more sophisticated attack vectors and adaptive adversaries.

Future Directions

Key avenues for future research include:

Enriching the plan representation to support hierarchical, branched, or parallel workflows.
Learning personalized interruption and approval policies based on user preferences and trust calibration.
Integrating richer forms of agent memory, including user preferences, context, and long-term adaptation.
Scaling HITL paradigms to support collaborative teams of humans and agents, and to reduce the cognitive burden of oversight in multi-tasking scenarios.
Advancing automated detection and mitigation of prompt injection and other emergent attack vectors.

Conclusion

Magentic-UI provides a comprehensive, extensible platform for studying and advancing human-in-the-loop agentic systems. Its architecture and interaction mechanisms operationalize key principles for safe, effective, and user-aligned agent deployment. Empirical results substantiate the value of HITL, both for improving task success and for mitigating safety risks. The open-source release of Magentic-UI offers a foundation for further research into scalable, reliable, and trustworthy agentic AI.

PDF Markdown

Follow-up Questions

Related Papers

Authors (20)

First 10 authors:

Tweets

https://twitter.com/pyautogen/status/1950935027702714702

https://twitter.com/HsseinMzannar/status/1950405596265762860