Deep Research Comparator Platform
- Deep Research Comparator is an evaluation platform that compares deep research agents using transparent process and outcome metrics.
- It integrates modular architecture with a user-friendly interface for side-by-side report reviews and granular human annotation.
- The platform employs dual ranking methodologies including BT scoring and process upvote rates to diagnose and enhance agent reasoning.
A deep research comparator is an evaluation, annotation, and ranking platform designed to rigorously compare the outputs and intermediate steps of deep research agents—autonomous systems that search, synthesize, and generate comprehensive reports in response to complex research queries. The platform integrates end-to-end agent hosting, process-level and outcome-level human feedback, and systematized ranking methodologies to support fine-grained, transparent, and actionable evaluation of open-domain research agents (Chandrahasan et al., 7 Jul 2025).
1. Architectural Framework and System Design
The deep research comparator is structured into three principal tiers: a user-facing static web interface, a main backend service, and an agent serving service. The frontend allows users to submit research queries, view final reports from two different agents displayed side-by-side, and observe each agent’s intermediate generation steps in real time. The main backend service orchestrates query routing, receives agent outputs as streamed JSON responses, and logs user interaction and annotation data for subsequent ranking computation. Each deep research agent runs within its own isolated environment (typically Dockerized), streams both intermediate outputs and final reports, and can be implemented by wrapping any LLM within the provided agent scaffold.
This modular system, summarized in the following table, enables reproducibility and extensibility:
Component | Function | Example Interface Element |
---|---|---|
Frontend | Side-by-side report display, feedback capture | Web interface with annotation tools |
Main Backend Service | Query routing, annotation storage, ranking | Aggregator and preference tracker |
Agent Serving | Agent execution and step streaming | Docker containers with streaming |
System diagrams (see (Chandrahasan et al., 7 Jul 2025), Fig. 2) depict the query flow from user input to multi-agent output, fine-grained annotation, and ranking module.
2. Fine-grained Human Evaluation and Feedback Modalities
A core innovation in the deep research comparator is its support for multi-level human annotation. Two principal evaluation modalities are implemented:
- Side-by-Side Report Voting: Annotators review full research reports from two agents, cast comparative judgments via discrete options (“Agent A is better,” “Agent B is better,” “Tie,” “Both are bad”), and thus assess overall outcome quality.
- Process-level Fine-grained Feedback: Annotators examine each agent’s sequence of intermediate steps—each consisting of a model “thought” and an action (e.g., plan, web search, evidence synthesis, summary, or answer)—and can upvote or downvote individual steps for relevance, correctness, or informativeness. In the final report output, users may further highlight specific text spans for targeted feedback.
These modalities allow for both high-level outcome ranking and diagnostic, step-level feedback, informing improvements to reasoning strategies, evidence integration, and answer synthesis.
3. Ranking Methodologies and Metrics
Agent quality is quantified through two complementary ranking metrics:
- Outcome-based Ranking: Pairwise side-by-side votes are aggregated using the Bradley–Terry (BT) model, yielding a continuous agent quality score. The comparator establishes a baseline (e.g., Simple Deepresearch at BT score 1000), with other agents’ BT scores computed relative to this anchor. This enables robust, relative ranking of competing agent systems.
- Process-based Metrics: The primary step-level measure is the upvote rate, defined as
Both outcome- and process-based metrics are recalculated periodically as new feedback is ingested, providing dual insight into the quality of final reports and the procedural rigor of agent stepwise reasoning.
4. Agent Scaffold and Integration: Simple Deepresearch
The platform introduces Simple Deepresearch, a standardized, prompt-based agent scaffold supporting integration of any LLM as the underlying policy. The agent executes an iterative loop:
- Receives user query and current history .
- Produces a thought and selects from a predefined action space (plan, search, script, summary, answer).
- Updates its memory according to formal rules, e.g.,
- For a search action:
- For a plan or script:
- For summary action (context condensing):
This scaffold ensures agent processes are standardized, transparent, and easily connected to the platform’s comparison infrastructure, reducing engineering overhead for evaluation of new models.
5. Human Preference Annotation Experiments and Insights
Empirical evaluation with real annotators demonstrates the platform’s effectiveness. In the reported deployment, 17 annotators assessed three deep research agents over 176 queries. Results showed:
- Side-by-side voting, when aggregated via the BT model, established a relative quality ranking: for instance, GPT Researcher received the highest BT score, followed by Perplexity DeepResearch, with Simple Deepresearch (Gemini 2.5 Flash) baseline.
- Step upvote rates revealed that even when agents achieved comparable overall report quality, process-level feedback could distinguish differences in reasoning transparency and intermediate step effectiveness.
- Annotators valued the ability to drill into specific actions or text spans to highlight strong or weak reasoning elements—a feature particularly relevant for debugging complex multi-hop research behavior.
6. Broader Implications and Platform Utility
The deep research comparator framework directly addresses emerging challenges in rigorous evaluation of open-domain, multi-step research agents. By synthesizing outcome and process feedback, it supports:
- Transparent, reproducible benchmarking of agentic LLMs across versions and implementations.
- Diagnosis and targeted enhancement of agent reasoning.
- The creation of large fine-grained datasets of human feedback, useful for training or fine-tuning more robust agentic models.
The modular design and agent scaffold facilitate swift integration of new research systems, fostering rapid iteration and comparison across diverse benchmarks and domains.
7. Visualization and Demonstration
A public demo video (https://www.youtube.com/watch?v=g4d2dnbdseg) showcases the platform’s main features, user interface, step visualization, and annotation workflow. The real-time interface demonstrates practical usability for both annotators and research developers, reinforcing the system’s value as both a research comparator and a tool for agent improvement.
The deep research comparator platform represents a significant step in the systematic evaluation and development of deep research agents, enabling side-by-side, fine-grained, and process-aware comparison grounded in human judgment (Chandrahasan et al., 7 Jul 2025).