Papers
Topics
Authors
Recent
Search
2000 character limit reached

ResearStudio: Human-Intervenable Deep-Research Framework

Updated 4 July 2026
  • ResearStudio is an open-source framework that enables controllable deep-research agents with real-time human intervention.
  • It features a hierarchical Planner–Executor architecture with a live plan-as-document interface for dynamic task management.
  • The framework achieves state-of-the-art GAIA benchmark results while ensuring sandboxed safety and flexible role switching.

Searching arXiv for the cited papers to ground the article in current records. to=arxiv_search 彩经彩票 彩票天天乐json {"2query2 OR \2"ResearStudio: A Human-Intervenable Framework for Building Controllable Deep-Research Agents\"","max_results"9relevance10"sort_by":"relevance","sort_order":"descending"}【อ่านข้อความเต็มjson to=arxiv_search 】【:】【“】【result":[{"arxiv_id":"(&&&2query2&&&)v2id:(Yang et al., 14 Oct 2025) OR \2","title":"ResearStudio: A Human-Intervenable Framework for Building Controllable Deep-Research Agents","authors":["Yutao Zeng","Zhibo Wang","Zhengxuan Chen","Ye Yuan","Yaowen Xiao","Yutong Liang","Jieyu Yang","Zijia Lin"],"abstract":"Current deep-research agents run in a ''fire-and-forget'' mode: once started, they give users no way to fix errors or add expert knowledge during execution. We present ResearStudio, the first open-source framework that places real-time human control at its core. The system follows a Collaborative Workshop design. A hierarchical Planner-Executor writes every step to a live ''plan-as-document,'' a fast communication layer streams each action, file change, and tool call to a web interface. At any moment, the user can pause the run, edit the plan or code, run custom commands, and resume -- switching smoothly between AI-led, human-assisted and human-led, AI-assisted modes. In fully autonomous mode, ResearStudio achieves state-of-the-art results on the GAIA benchmark, surpassing systems like OpenAI's DeepResearch and Manus. These results show that strong automated performance and fine-grained human control can coexist. The full code, protocol, and evaluation scripts are available at https://github.com/ResearAI/ResearStudio. We will continue to update the repository to encourage further work on safe and controllable research agents. Our live demo is publicly accessible at http://ai-researcher.net^^^^9max_results2query2query2query2^^^^/. We support the development of DeepScientist, which can be accessed at https://github.com/ResearAI/DeepScientist.","categories":["cs.AI","cs.HC","cs.LG"],"published":"2^^^^2query22^^^^5-^^^^2id:([2510.12194](/papers/2510.12194)) OR \2query2-2id:(Yang et al., 14 Oct 2025) OR \24","pdf_url":"http://arxiv.org/pdf/(&&&^^^^2query2^^^^&&&)v^^^^2id:([2510.12194](/papers/2510.12194)) OR \2","abs_url":"https://arxiv.org/abs/(&&&^^^^2query2^^^^&&&)v^^^^2id:([2510.12194](/papers/2510.12194)) OR \2"}]} ResearStudio is an open-source framework for building controllable deep-research agents that supports real-time human intervention during execution rather than a purely “fire-and-forget” operating mode. It is organized around a shared, persistent workspace, a hierarchical Planner–Executor agent core, and a web interface that exposes plans, files, tool calls, and intermediate artifacts as live, editable objects. The framework is presented as a “Collaborative Workshop,” emphasizing transparency, symmetrical control, and dynamic role fluidity: users can pause a run, edit the plan or code, run custom commands, and resume, switching smoothly between AI-led, human-assisted and human-led, AI-assisted modes. In fully autonomous mode, the system is reported to achieve state-of-the-art results on GAIA among the compared systems, while retaining its central design commitment to fine-grained human control (&&&2query2&&&).

ResearStudio addresses a limitation attributed to prior deep-research agents: once launched, they typically offer little opportunity for meaningful intervention before completion. The framework is motivated by cases in which an agent misinterprets the task, follows an unproductive search path, writes flawed code, or incorporates misleading information. In such settings, passive observation is operationally costly because errors propagate through a rigid pipeline rather than being corrected in situ (&&&2query2&&&).

The paper frames its alternative as a new interaction paradigm, the Collaborative Workshop, characterized by three properties. Transparency means that plans, intermediate artifacts, and actions are visible. Symmetrical Control means that both the human and the agent can modify the shared workspace. Dynamic Role Fluidity means that control can shift smoothly between AI-led and human-led workflows. This design moves intervention from an external supervisory role into the core execution model itself (&&&2query2&&&).

A central claim is that the system combines capabilities that are usually separated across tool categories. The paper contrasts it with industrial deep-research systems, which are described as capable but not controllable, and with canvas-like editors or document systems, which are described as controllable but not sufficiently agentic. A plausible implication is that ResearStudio is intended not merely as a better agent, but as a different operational substrate for research work: an editable, inspectable, sandboxed project space in which autonomy and intervention coexist (&&&2query2&&&).

2. Three-layer architecture

The framework is organized into a three-layer architecture consisting of L-2id:(Yang et al., 14 Oct 2025) OR \2: MCP Toolbox, L-2: Agent Core, and L-3: WebPage / interactive UI (&&&2query2&&&).

At L-2id:(Yang et al., 14 Oct 2025) OR \2^, tools are exposed through the Model-Context Protocol (MCP), implemented with fastmcp. MCP standardizes tool invocations as reliable JSON-based function calls. The tool layer includes document-processing tools, search tools, and code or shell execution tools, while browser automation is available only optionally and is deliberately de-emphasized by default (&&&2query2&&&).

At L-2, the agent core is hierarchical. The Planner decomposes the task and writes or updates a live plan, while the Executor carries out concrete steps using tools. The important architectural point is that planning is externalized: the plan is written into a visible and editable TODO.md file rather than remaining an inaccessible latent state. This makes the plan itself part of the runtime interface between human and agent (&&&2query2&&&).

At L-3, the web interface presents the conversation or activity stream, workspace files, the current plan, file diffs and changes, and controls for pausing, resuming, and editing. The paper emphasizes that this is not a chat-only frontend. It is an integrated workspace for human-agent partnership in which the execution trace, file system state, and intervention controls are unified (&&&2query2&&&).

3. Plan-as-document and live communication

One of the framework’s defining mechanisms is the plan-as-document design. The Planner continuously writes and revises its plan in TODO.md, which is simultaneously visible to the user, editable by the user, and used as the active control surface for execution. The workflow described in the paper is: the Planner creates or revises a plan, writes it into TODO.md, the Executor performs the next step, and the user may inspect or edit the plan at any point. If the plan is changed, execution can be redirected without restarting the run from scratch (&&&2query2&&&).

This mechanism is coupled to a communication layer described as both a central communication protocol and a dual-layered communication system. At the machine level, MCP mediates structured tool requests. At the human-agent level, the system uses an event-driven protocol with a long-lived connection between frontend and backend. User actions are sent as API calls with a unique task ID, and the backend streams updates to the UI in real time. Large files are lazy-loaded on click to keep the interface responsive (&&&2query2&&&).

The paper highlights several concrete workflows enabled by this design. Under Submit Task, the agent autonomously executes through the tool layer. Under Change Files, a POST request sends new file content, updates the workspace, and notifies the executor. Under Pause/Resume, a request stalls all backend LLM calls, freezing the agent’s cognitive state until resumed. The paper’s emphasis is that the interface is not merely observational; it is a live control plane for the entire agent trajectory (&&&2query2&&&).

4. Workspace, tooling, and intervention modalities

ResearStudio’s workspace is both persistent and sandboxed. Each task runs in a fully sandboxed workspace isolated from the host and from other tasks, and tool operations are confined to this directory. The paper presents this as a controllability feature and as part of its safety posture, because mediated tool execution replaces direct system access (&&&2query2&&&).

The document-processing subsystem supports a broad family of file types:

PRESERVED_PLACEHOLDER_2query2^

and for each file PRESERVED_PLACEHOLDER_2id:(Yang et al., 14 Oct 2025) OR \2^ the system invokes a modality-specific extractor D(f)D(f), yielding text, captions, or structured objects. The examples given include VLM captions for images, ASR transcripts for audio, slide-wise markdown for .pptx, and row-wise CSV parsing for spreadsheets. These extracted artifacts are displayed immediately in the UI, where they can be edited, annotated, or discarded before the agent proceeds (&&&2query2&&&).

The search subsystem combines a self-hosted searxng metasearch, Crawl4AI page fetches, and reranking by contextual similarity. Users can accept a result, reject a result, or request deeper crawling. Browser automation is excluded by default because, according to the authors, LLM planners tend to overuse browser control, which adds latency and yields little structured information; it can nonetheless be re-enabled with a toggle if needed (&&&2query2&&&).

The code environment is sandboxed and persistent. It supports creating files, running shell or Python commands, inspecting outputs across iterations, rollback to previous snapshots, and branch creation for “what-if” exploration. Every script is shown before execution, and the user may modify code, inject assertions, comment out commands, or disable snippets. A safe-exec guard whitelists common packages such as numpy, pandas, and torch. Standard output, error streams, and rich artifacts such as tables and figures stream back live and are logged as immutable cells (&&&2query2&&&).

The intervention interface supports pausing the run, editing TODO.md, editing code or data files, running custom terminal commands, resuming execution, and downloading or exporting the entire workspace. The paper explicitly describes AI-led, Human-assisted and Human-led, AI-assisted collaboration, and also notes smooth transitions among AI-led, human-assisted, human-led, and AI-assisted modes. The central architectural claim is that role switching does not require restarting the session (&&&2query2&&&).

5. Operational workflow and model configuration

The paper describes the core execution cycle as a closed-loop Planner–Executor workflow mediated by shared documents and streaming events. A task begins when the user submits a request. The Planner creates a live plan in TODO.md. The Executor selects tools and performs actions. Tool calls and outputs are streamed to the UI. The user may accept or reject results, edit files, pause the system, or otherwise intervene. Once resumed, the backend continues from the revised workspace state (&&&2query2&&&).

Model selection is role-specific. The Planner uses gpt-4.^^^^2id:([2510.12194](/papers/2510.12194)) OR \2^^^^. The Executor uses o4-mini for datasets excluding GAIA. The Image processing component uses gpt-4o. The Video agent uses gemini-2.5-pro. The Audio agent uses Assembly AI. The GAIA benchmark executor uses o3. The paper presents this allocation as a balance between efficiency and task-specific capability (&&&2query2&&&).

The paper also reports typical operational parameters: Average Task Runtime (GAIA) is about 22query2^ minutes, Max Concurrent Workers is 52query2^, Max Interaction Rounds is 32query2^, Average Steps per Task is about 25, and Typical Final Workspace Size is about 2id:(Yang et al., 14 Oct 2025) OR \2query2query2^ MB. These values are presented as evidence that the framework is practical for long-horizon tasks and scalable deployment (&&&2query2&&&).

The qualitative appendix is used to illustrate how the workflow behaves in success and failure cases. In one successful GAIA Level-3 computational puzzle involving two 2id:(Yang et al., 14 Oct 2025) OR \22-digit numbers, transpositions, and a weighted-sum checksum, the agent completed the task in about four minutes over 2id:(Yang et al., 14 Oct 2025) OR \26 discrete steps by deciding to write a Python program rather than reasoning purely in text. In a failed Mariana Trench / Freon-2id:(Yang et al., 14 Oct 2025) OR \22^ case, the agent chose the wrong temperature by confusing ambient trench temperature with hydrothermal vent temperature. The paper uses the latter case to argue that real-time intervention is valuable because a human could pause the run, edit the plan, and resume (&&&2query2&&&).

6. Evaluation on GAIA and empirical positioning

ResearStudio is evaluated on GAIA, following Memento. All reported experiments were run in fully autonomous mode with no human intervention, specifically to measure the architecture’s raw capability rather than collaborative gains. The metric is Exact Match (EM), defined by the paper as follows: “a prediction is considered correct only if it exactly matches the reference answer after normalizing for case, punctuation, and articles. The EM score is defined as the percentage of answers that achieve a perfect match” (&&&2query2&&&).

The reported GAIA results are summarized below.

Split Level scores Average
Validation 77.36 / 69.77 / 62id:(Yang et al., 14 Oct 2025) OR \2.54 72query2.92id:(Yang et al., 14 Oct 2025) OR \2^
Test 84.95 / 72.33 / 59.2id:(Yang et al., 14 Oct 2025) OR \28 74.2query29

On GAIA validation, the paper reports ResearStudio (Pass@2id:(Yang et al., 14 Oct 2025) OR \2) scores of 77.36 on Level-2id:(Yang et al., 14 Oct 2025) OR \2, 69.77 on Level-2, 62id:(Yang et al., 14 Oct 2025) OR \2.54 on Level-3, and 72query2.92id:(Yang et al., 14 Oct 2025) OR \2^ on average. The baselines listed in the excerpt include ODR-smolagents at 55.2id:(Yang et al., 14 Oct 2025) OR \25 average, AutoAgent at 55.2id:(Yang et al., 14 Oct 2025) OR \25, OWL at 69.2query29, A-World at 69.72query2^, and OpenAI-DeepResearch at 67.36. The paper highlights that ResearStudio is particularly strong on harder tasks, with the best reported numbers on Level-2 and Level-3 (&&&2query2&&&).

On the GAIA test set, the paper reports 84.95 for Level-2id:(Yang et al., 14 Oct 2025) OR \2, 72.33 for Level-2, 59.2id:(Yang et al., 14 Oct 2025) OR \28 for Level-3, and 74.2query29 average. It is reported to outperform the listed open-source baselines, including OWL at 62query2.82query2 average and A-World at 63.2id:(Yang et al., 14 Oct 2025) OR \22^. The abstract additionally claims that ResearStudio surpasses systems like OpenAI’s DeepResearch and Manus in fully autonomous mode; the details note that the explicit comparison shown in the excerpt is strongest against OpenAI-DeepResearch, OWL, A-World, and other open-source baselines (&&&2query2&&&).

A recurrent theme in the paper is that these autonomous results are not meant to displace the collaborative premise. Rather, they are used to argue that strong automated performance and fine-grained human control can coexist. This suggests that the framework’s contribution lies in preserving intervention without sacrificing competitive autonomous behavior (&&&2query2&&&).

7. Safety, limitations, and relation to adjacent systems

The paper makes several safety-related claims, but also states important boundaries on those claims. Safety is primarily architectural: tasks run in fully sandboxed workspaces; tool execution is mediated through MCP rather than direct system calls; TODO.md functions as an inspectable checkpoint at which flawed or unsafe strategies can be caught; and input classifiers are used to filter disallowed content. The architecture is also said to help users spot anomalous changes in the plan or activity log and pause the run. However, the authors explicitly note that these measures have not yet been rigorously stress-tested against active adversarial attacks (&&&2query2&&&).

Three limitations are emphasized. First, collaborative use depends on human expertise, because spotting subtle errors can be cognitively demanding. Second, the paper does not yet formally quantify the gains from human intervention, since the evaluation focuses on autonomous performance. Third, safety is not adversarially hardened; it is architectural rather than the product of rigorous red-teaming. The proposed future work includes AI-powered alerts for semi-autonomous intervention, formal HCI studies measuring task time, error correction rate, and user satisfaction, and red-teaming against prompt injection, data exfiltration, and harmful generation (&&&2query2&&&).

Within the broader landscape of research-support systems, ResearStudio occupies a distinct position. RTVis is a visualization toolkit for topic discovery that combines a field theme river, a co-occurrence network, a specialized citation bar chart, and a word frequency race diagram to help researchers analyze paper metadata and understand research trends (&&&32query2&&&). FS-Researcher is a file-system-based dual-agent framework for long-horizon deep research that externalizes state into a persistent workspace and separates context building from report writing to achieve test-time scaling beyond the context window (&&&32id:(Yang et al., 14 Oct 2025) OR \2&&&). ResearStudio differs in emphasis: its central contribution is not research-trend visualization or context-window scaling per se, but a human-intervenable execution environment in which the plan, files, code, and tool outputs are continuously inspectable and editable during the run (&&&2query2&&&).

The paper’s broader significance is therefore methodological. It proposes that deep-research systems need not choose between autonomy and control. Instead, the agent loop can be restructured around a shared workspace, a live plan document, and an event-driven interface that supports intervention as an ordinary part of execution rather than as an after-the-fact correction mechanism.

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to ResearStudio.