Simple Deepresearch: LLM Research Agent Framework

Updated 13 July 2025

Simple Deepresearch is a prompt-based, end-to-end framework that transforms large language models into modular research assistants.
It employs an iterative think–act–observe–update cycle with structured actions like planning, searching, scripting, summarizing, and reporting.
The framework integrates with the Deep Research Comparator for real-time benchmarking and detailed human feedback on both process and outcome.

Simple Deepresearch refers to a prompt-based, end-to-end agent framework that transforms a LLM into a baseline deep research agent by orchestrating iterative web search, evidence synthesis, and report generation within a rigorously structured reasoning and action loop. Developed as part of the Deep Research Comparator platform, Simple Deepresearch serves both as a reference implementation for evaluating LLM-based research agents and as a modular scaffold for rapid prototyping, benchmarking, and development of new academic research assistants (Chandrahasan et al., 7 Jul 2025).

1. Workflow and Agent Architecture

Simple Deepresearch follows an iterative “think–act–observe–update” process, in which the agent leverages the user query and an evolving internal history to decide its next action at each step. Each cycle, indexed by step $k$ , takes in the user query $q$ and the agent's current history $h_k$ —the record of all prior model thoughts, actions, and observations—and produces:

a thought process $t_k$ (the model's internal deliberation, either explicit or implicit), and
an action $a_k$ (selected from a predefined, structured action space).

The regulated action space includes:

<plan>: Formulate a high-level approach or research plan,
<search>: Issue a web search via an API call,
<scripts>: Create or revise the draft of the report,
<summary>: Compress previous history (for models with limited context),
<answer>: Output the complete research report as the final agent act.

A table in the source (Table 1 (Chandrahasan et al., 7 Jul 2025)) enumerates these actions and their format, ensuring structured, trackable agent behavior. This design enables powerful control over agent interpretability and facilitates downstream analysis of each research step.

2. Iterative History Update and Formalism

The agent's state evolves according to a precise recursive update, facilitating transparent memory management and efficient prompting for LLMs.

The update rule for the agent’s history after step $k$ is captured formally in LaTeX as:

$h_{k+1} = \begin{cases} h_k + t_k + a_k + \mathrm{obs}_k, & \text{if } a_k \text{ is a Search action}\ h_k + t_k + a_k, & \text{if } a_k \text{ is a Plan or Script action}\ a_k, & \text{if } a_k \text{ is a Summary action} \end{cases}$

Here, $\mathrm{obs}_k$ represents any external observations (such as API search results) returned for a Search action. The design ensures that only the essential context is retained, enabling systematic state “compression” and supporting models with limited context length.

3. Integration Capabilities and LLM Interoperability

The prompt-based scaffold and standardized action tagging allow the Simple Deepresearch framework to be readily integrated with any LLM supporting an API interface (e.g., OpenAI’s completions endpoint). Developers can adapt diverse LLMs to the same reasoning scaffold by specifying prompt templates and consistent intermediate output formats (e.g., tagged actions with explicit content boundaries).

This architecture markedly lowers the engineering barrier for comparative evaluation: models can be swapped, combined, or ablated directly within the same iterative framework, supporting robust side-by-side experimental designs (Chandrahasan et al., 7 Jul 2025).

4. System-Level Evaluation Framework: Deep Research Comparator

Simple Deepresearch has been deployed within the Deep Research Comparator platform, a system supporting:

Hosting and orchestration of multiple deep research agents,
Real-time, side-by-side comparison of final reports and intermediate steps,
Fine-grained human feedback via step and span annotation,
Aggregation of both outcome-based (final report quality) and process-based (reasoning trace) metrics.

The modular platform comprises:

A static web frontend listing agent outputs and reasoning chains in parallel,
Main backend services handling user queries, agent streaming, and result collation,
Isolated agent containers serving decisions via a unified JSON protocol,
A ranking calculation using models such as Bradley–Terry to aggregate pairwise votes and upvote/downvote statistics.

5. Human Feedback: Outcome and Process-level Metrics

To enable comprehensive agent assessment, the platform employs dual feedback mechanisms:

Outcome-based ratings involve human annotators voting on the comparative quality of final reports (better, tie, or both bad) between two agents, allowing for robust leaderboard construction and overall ranking.

Fine-grained feedback is collected by allowing annotators to:

Upvote/downvote individual intermediate steps in the agent’s reasoning trace,
Highlight and annotate specific spans of the final report with positive or negative feedback.

These annotations yield process-oriented metrics such as step upvote rates (fraction of positively rated steps per agent), offering insight into the reliability and transparency of the agent’s research methodology. As demonstrated in benchmarking experiments, this dual granularity is essential: agents that deliver strong final reports but weak underlying reasoning (e.g., “abstruse” or opaque steps) are penalized in process-based metrics, guiding targeted improvements in agent design (Chandrahasan et al., 7 Jul 2025).

6. Benchmarking and Case Study Results

An evaluation involving 176 real user queries compared three agents: Perplexity DeepResearch (proprietary), GPT Researcher (open-source), and Simple Deepresearch (Gemini 2.5 Flash). Annotators cast pairwise votes on reports and provided 1,281 step-wise and 593 span-wise annotations. Results revealed:

The outcome-based ranking (via Bradley–Terry) identified the preferred agents overall.
Process metrics highlighted cases where even high-ranking final reports masked less reliable or less interpretable reasoning chains.
The integrated feedback loop enables fine-tuning of prompt structures and agent process design; this can inform reinforcement learning from human feedback (RLHF) and scaffold future iterative development.

7. Technical and Practical Significance

Simple Deepresearch exemplifies a minimalist yet structurally robust approach to LLM-based deep research agent design. Its contributions include:

A modular, prompt-driven, and easily extensible framework supporting rapid benchmarking and integration of new LLMs,
Explicit, formal process tracking and updates for reproducibility and interpretability,
Seamless integration into process- and outcome-focused evaluation platforms, supporting both academic research and industrial deployment,
The facility to gather multifaceted human feedback at every stage, offering actionable signals for procedural and output quality improvement.

The framework has established itself as a reference baseline for evaluating deep research agents and provides a foundation for future research in both algorithmic workflow design and human-aligned evaluation (Chandrahasan et al., 7 Jul 2025).

PDF Markdown Chat (Pro)

References (1)

Deep Research Comparator: A Platform For Fine-grained Human Annotations of Deep Research Agents (2025)

Follow Topic

Get notified by email when new papers are published related to Simple Deepresearch.