Magentic-UI: Towards Human-in-the-loop Agentic Systems (2507.22358v1)

Published 30 Jul 2025 in cs.AI and cs.HC

Abstract: AI agents powered by LLMs are increasingly capable of autonomously completing complex, multi-step tasks using external tools. Yet, they still fall short of human-level performance in most domains including computer use, software development, and research. Their growing autonomy and ability to interact with the outside world, also introduces safety and security risks including potentially misaligned actions and adversarial manipulation. We argue that human-in-the-loop agentic systems offer a promising path forward, combining human oversight and control with AI efficiency to unlock productivity from imperfect systems. We introduce Magentic-UI, an open-source web interface for developing and studying human-agent interaction. Built on a flexible multi-agent architecture, Magentic-UI supports web browsing, code execution, and file manipulation, and can be extended with diverse tools via Model Context Protocol (MCP). Moreover, Magentic-UI presents six interaction mechanisms for enabling effective, low-cost human involvement: co-planning, co-tasking, multi-tasking, action guards, and long-term memory. We evaluate Magentic-UI across four dimensions: autonomous task completion on agentic benchmarks, simulated user testing of its interaction capabilities, qualitative studies with real users, and targeted safety assessments. Our findings highlight Magentic-UI's potential to advance safe and efficient human-agent collaboration.

Summary

The paper introduces Magentic-UI, an open-source HITL framework that enhances agent safety and reliability through integrated human oversight.
It utilizes co-planning, co-tasking, and action guards in a modular multi-agent architecture to manage digital tasks like web navigation and code execution.
Quantitative evaluations show that lightweight human feedback boosts task accuracy by 71% in simulated user studies, highlighting practical performance gains.

Magentic-UI: A Human-in-the-Loop Framework for Agentic Systems

Introduction and Motivation

Magentic-UI addresses the persistent gap between the capabilities of autonomous LLM-based agents and the requirements for safe, reliable, and effective deployment in real-world, open-ended tasks. While LLM agents have demonstrated progress in tool use, web navigation, and code execution, they remain susceptible to misalignment, ambiguity, and adversarial manipulation. The paper posits that human-in-the-loop (HITL) agentic systems, which combine human oversight with agentic efficiency, are a promising approach to mitigate these limitations and unlock productivity from imperfect systems.

Figure 1: Magentic-UI is an open-source research prototype of a human-centered agent that is meant to help researchers paper open questions on human-in-the-loop approaches and oversight mechanisms for AI agents.

Magentic-UI is introduced as an open-source, extensible research platform for studying and developing HITL agentic systems. It is designed to support a broad range of digital tasks, including web browsing, code execution, and file manipulation, and is architected to facilitate systematic experimentation with human-agent interaction mechanisms.

System Design and Interaction Mechanisms

Magentic-UI is built on a modular multi-agent architecture, adapted from Magentic-One, and leverages the AutoGen framework for orchestrating agent collaboration. The system treats the human user as a first-class agent, enabling seamless integration of human input at multiple stages of task execution.

Figure 2: Overall System Architecture of Magentic-UI.

The core interaction mechanisms are:

Co-Planning: Human and agent collaboratively construct and refine a plan before execution, resolving ambiguities and incorporating human priors.
Co-Tasking: Dynamic handoff of control between agent and user during execution, allowing for human intervention, agent queries, and verification.
Action Approval (Action Guard): Irreversible or potentially harmful actions are intercepted and require explicit human approval, using a combination of heuristics and LLM-based judgment.
Answer Verification: Users can inspect agent actions, review execution traces, and pose follow-up queries to validate outcomes.
Memory: Plans can be saved, edited, and reused, supporting long-term adaptation and efficient handling of repetitive tasks.
Multi-Tasking: Users can manage and monitor multiple concurrent sessions, each with independent agent teams.
Figure 3: The Magentic-UI interface displaying a task in progress, with session management and real-time agent/browser interaction.

Figure 4: The plan editor enables direct user modification of the agent's plan prior to execution.

Figure 5: Co-tasking in Magentic-UI: (a) user interrupts agent, (b) agent queries user, (c) user verifies final answer.

Figure 6: Saved plans view, supporting plan reuse and long-term memory.

Multi-Agent Orchestration and Agent Protocols

The orchestrator agent governs the workflow, alternating between planning and execution modes. Plans are represented as sequences of natural language steps, each assigned to a specific agent. The orchestrator dynamically assigns steps, tracks progress, and triggers replanning as needed.

Figure 7: Simplified Orchestrator loop governing planning, execution, and dynamic replanning.

Agents in Magentic-UI include:

WebSurfer: Browser automation with a rich action space, sandboxed via Docker, and governed by allow-lists.
Coder: Code generation and execution in isolated containers, with error handling and iterative refinement.
FileSurfer: File operations and conversions, leveraging MarkItDown tools.
MCP Agents: Integration of external Model Context Protocol servers for extensibility.
UserProxy: The human user, with a protocol for when and how to defer to human input.

Action Guard: Safety and Oversight

To address the risk of irreversible or unsafe actions, Magentic-UI implements a two-stage action guard. Actions are first filtered by developer-specified heuristics, then, if ambiguous, passed to an LLM-based judge for approval determination.

Figure 8: Action guard system ensures human review of potentially harmful or irreversible agent actions.

Evaluation: Autonomous, Simulated, and Human-in-the-Loop

Autonomous Agent Performance

Magentic-UI was evaluated on GAIA, AssistantBench, WebVoyager, and WebGames benchmarks. In autonomous mode, it matches the performance of Magentic-One on GAIA and AssistantBench, and achieves competitive results on WebVoyager and WebGames, with 82.2% and 45.5% task completion rates, respectively, using o4-mini and GPT-4o. However, it remains below human-level performance and the best specialized agents on certain tasks.

Figure 9: Distribution of Magentic-UI run times on WebVoyager, showing longer runtimes for unsuccessful tasks.

Figure 10: Planning statistics: plan length and replanning rates across benchmarks.

Human-in-the-Loop and Simulated User Studies

Simulated user experiments on GAIA demonstrate that lightweight human feedback can substantially improve agent performance. With a simulated user possessing side information, Magentic-UI's accuracy increases by 71% (from 30.3% to 51.9%), with minimal user intervention (help requested in only 10% of tasks).

Figure 11: Human-in-the-loop improves Magentic-UI accuracy, bridging the gap to human performance at lower cost.

Qualitative User Study

A qualitative paper with 12 participants assessed usability and the effectiveness of HITL features. The System Usability Scale score was 74.58, with 75% of users finding the system easy to use. Participants valued co-planning for aligning on subjective preferences, co-tasking for error correction and control, and action guards for critical decisions. Latency and verbosity were noted as pain points, and users expressed a desire for more visual summaries and flexible plan editing.

Figure 12: System Usability Scale results indicate positive user experience, with some reservations about frequent use.

Safety and Security Analysis

Magentic-UI underwent targeted adversarial testing, including direct requests for sensitive actions, social engineering, and prompt injection attacks. With default mitigations enabled (action guard, sandboxing, isolated browser), no adversarial scenario was successful. Disabling mitigations revealed that prompt injection remains a significant risk, underscoring the necessity of layered defenses.

Figure 13: Magentic-UI identifies a phishing popup, re-plans, and pauses for user approval.

Figure 14: Encountering egregious OAuth requests, Magentic-UI pauses and requests user approval before proceeding.

Figure 15: On prompt injection attempts to leak SSH keys, Magentic-UI re-plans and halts for user review.

Discussion, Limitations, and Implications

Magentic-UI operationalizes a taxonomy of HITL mechanisms for agentic systems, providing concrete solutions for plan alignment, oversight, and verification. The shared plan representation, action histories, and status indicators facilitate transparency and control, but challenges remain in scaling plan editing, summarizing long action traces, and tuning interruption frequency to user preferences.

The system's performance is still below human-level on complex tasks, particularly those requiring advanced coding, multimodal understanding, or extended action sequences. The evaluation is limited to English and does not measure long-term productivity gains. Safety relies on strict sandboxing and user vigilance; prompt injection remains a latent risk if mitigations are bypassed.

Practically, Magentic-UI provides a robust platform for researchers and developers to experiment with HITL paradigms, test new agent behaviors, and paper the trade-offs between autonomy and oversight. Theoretically, it advances the understanding of how to structure human-agent collaboration for reliability and safety.

Conclusion

Magentic-UI demonstrates that human-in-the-loop interaction, implemented as a core architectural principle, can improve the reliability, safety, and effectiveness of agentic systems. Its open-source release provides a foundation for further research into scalable, trustworthy, and user-aligned AI agents. Future work should address the scalability of HITL mechanisms, richer plan representations, adaptive interruption strategies, and more comprehensive safety guarantees.

PDF Markdown

Follow-up Questions

Related Papers

Authors (20)

First 10 authors:

Tweets

https://twitter.com/pyautogen/status/1950935027702714702

https://twitter.com/HsseinMzannar/status/1950405596265762860