MiroThinker-H1: Verified Deep Research Agent

Updated 4 July 2026

MiroThinker-H1 is a verification-centric research agent that combines deep reasoning with both local and global verification for robust multi-step problem solving.
It employs a dual-loop ReAct framework that iteratively plans, interacts with tools, and manages context to refine evidence synthesis during long-horizon tasks.
Benchmark evaluations demonstrate that MiroThinker-H1 outperforms competitors in open-web research, scientific reasoning, and multimodal analysis with improved efficiency and accuracy.

MiroThinker-H1 is a verification-centric deep research agent introduced as the heavy-duty reasoning mode built on top of MiroThinker-1.7 for complex long-horizon tasks involving planning, web search, evidence gathering, tool use, intermediate synthesis, and final report generation (Team et al., 16 Mar 2026). In the associated evaluation literature, it is characterized not merely as a model checkpoint but as an end-to-end research system whose defining feature is the integration of verification into inference at both local and global levels, with the aim of making multi-step problem solving more reliable under extended agent-environment interaction (Team et al., 16 Mar 2026). Across benchmark settings that include open-web research, scientific reasoning, financial analysis, and multimodal deep research, MiroThinker-H1 is reported as the top-ranked overall system in both text-only and multimodal settings in MiroEval, while the broader MiroThinker family is presented as an open-source research-agent line emphasizing model, context, and interaction scaling (Ye et al., 30 Mar 2026, Team et al., 14 Nov 2025).

1. System identity and family context

MiroThinker-H1 is introduced in "MiroThinker-1.7 & H1: Towards Heavy-Duty Research Agents via Verification" as the flagship system built from MiroThinker-1.7 and extended with heavy-duty reasoning capabilities for more reliable multi-step problem solving (Team et al., 16 Mar 2026). The paper draws a clear distinction between the underlying agent foundation and the H1 mode: MiroThinker-1.7 improves the reliability of individual interaction steps through agentic mid-training, whereas MiroThinker-H1 incorporates verification directly into inference at both local and global levels (Team et al., 16 Mar 2026).

This positioning is important because the paper treats H1 less as a separately specified pretrained architecture than as a verification-augmented reasoning configuration over the MiroThinker-1.7 agent substrate (Team et al., 16 Mar 2026). The open-source release statement in that paper applies explicitly to MiroThinker-1.7 and MiroThinker-1.7-mini, not to H1 itself (Team et al., 16 Mar 2026). A plausible implication is that H1 should be understood operationally as a system-level deployment mode rather than simply a base model name.

The broader family context is supplied by "MiroThinker: Pushing the Performance Boundaries of Open-Source Research Agents via Model, Context, and Interactive Scaling," which describes MiroThinker v1.0 as an open-source research agent family in 8B, 30B, and 72B sizes, trained to support deep tool-augmented reasoning with a 256K context window and up to 600 tool calls per task (Team et al., 14 Nov 2025). That paper does not explicitly mention a variant called MiroThinker-H1, and any direct mapping from v1.0 to H1 is therefore inferential rather than textual (Team et al., 14 Nov 2025). What it does establish is the family design philosophy: model scaling, context scaling, and especially interaction scaling as a third capability axis for research agents (Team et al., 14 Nov 2025).

2. Agent framework and interaction model

The agentic substrate inherited by MiroThinker-H1 is a single-agent ReAct-style framework with a dual-loop structure: an inner step loop for reasoning and tool interaction, and an outer episode loop for retry-based restarts (Team et al., 16 Mar 2026). At step $t$ in episode $e$ , the trajectory is represented as

$H_t^{(e)} = \bigl\{(T_1, A_1, O_1), \ldots, (T_{t-1}, A_{t-1}, O_{t-1})\bigr\},$

where $T_i$ denotes a thought, $A_i$ an action, and $O_i$ the resulting observation (Team et al., 16 Mar 2026). Rather than reasoning over the full raw history, the system applies a context operator that keeps only recent observations. The recent-step index set is

$S_t(K) = \bigl\{ i \in \{1,\ldots,t-1\} \mid i \ge t-K \bigr\},$

and observations are mapped by

$\Phi_t(O_i) = \begin{cases} \mathrm{Trunc}_L(O_i), & i \in S_t(K), \ \varnothing, & \text{otherwise}, \end{cases}$

giving the effective context

$C_t^{(e)} = \bigl\{\bigl(T_i, A_i, \Phi_t(O_i)\bigr)\bigr\}_{i=1}^{t-1}.$

Thought and action generation are then written as

$T_t = f_\theta(q, C_t^{(e)}), \qquad A_t = \pi_\theta(C_t^{(e)}, T_t),$

followed by environment execution,

$e$ 0

and trajectory extension,

$e$ 1

All of these equations are given in the H1 system paper (Team et al., 16 Mar 2026).

The outer episode loop resets the context to the original query when an episode exhausts its turn budget without producing a valid answer:

$e$ 2

This restart policy is explicitly motivated as a way to prevent stale or polluted trajectories from consuming the remaining budget (Team et al., 16 Mar 2026).

The tool environment includes structured search via google_search, page-level extraction via scrape_and_extract_info, and an E2B Linux sandbox supporting create_sandbox, run_command, and run_python_code, together with file-transfer utilities (Team et al., 16 Mar 2026). The broader MiroThinker family paper describes an analogous tool stack and emphasizes that such tooling is central to deep research workflows, including retrieval, scraping, code execution, and context management (Team et al., 14 Nov 2025).

3. Training foundation in MiroThinker-1.7

MiroThinker-H1 depends on the training advances of MiroThinker-1.7, which the paper describes as a four-stage pipeline: agentic mid-training, supervised fine-tuning, preference optimization, and reinforcement learning (Team et al., 16 Mar 2026). The role of this pipeline is to strengthen atomic agentic capabilities such as planning, contextual reasoning, tool use, and summarization before verification is added at inference time (Team et al., 16 Mar 2026).

The agentic mid-training stage is designed around single-turn planning data, interleaved reasoning data from partial trajectories, and intermediate summarization data (Team et al., 16 Mar 2026). The objective is

$e$ 3

This stage includes an "Agentic Planning Boosting" procedure in which the model receives only the user query and must produce a structured plan plus the first tool call, as well as "Agentic Reasoning and Summarization Sculpting," where a selected trajectory step is rewritten into a better reasoning or summarization target (Team et al., 16 Mar 2026). The paper reports that MiroThinker-1.7 improves over MiroThinker-1.5 by 16.7% while using about 43.0% fewer interaction rounds, and on HLE by 17.4% with 61.6% fewer rounds (Team et al., 16 Mar 2026).

The SFT stage trains on expert trajectories

$e$ 4

with loss

$e$ 5

The preference-optimization stage uses pairs of preferred and dispreferred trajectories,

$e$ 6

with a DPO-based objective and an auxiliary SFT term on preferred trajectories (Team et al., 16 Mar 2026):

$e$ 7

The final RL stage uses Group Relative Policy Optimization. The trajectory reward is defined as

$e$ 8

the group-relative advantage as

$e$ 9

and the GRPO objective as

$H_t^{(e)} = \bigl\{(T_1, A_1, O_1), \ldots, (T_{t-1}, A_{t-1}, O_{t-1})\bigr\},$ 0

A dynamic KL coefficient is also specified:

$H_t^{(e)} = \bigl\{(T_1, A_1, O_1), \ldots, (T_{t-1}, A_{t-1}, O_{t-1})\bigr\},$ 1

These formal components are important because H1’s verification layer presupposes a base agent already trained to generate competent plans, searches, tool interactions, and intermediate syntheses (Team et al., 16 Mar 2026).

4. Verification-centric heavy-duty reasoning

The defining addition in MiroThinker-H1 is the incorporation of verification into inference at both local and global levels (Team et al., 16 Mar 2026). The paper motivates this by arguing that longer trajectories alone do not guarantee better performance: in hard open-ended tasks, an early error may instead be amplified across subsequent steps (Team et al., 16 Mar 2026).

Local verification operates on intermediate reasoning decisions. The paper states that it evaluates and refines planning decisions, tool invocations, and hypothesis updates during inference, encouraging exploration beyond the default highest-probability path and correcting errors early in the trajectory (Team et al., 16 Mar 2026). No formal verifier score, architecture, or threshold is given in the text. The contribution is described functionally rather than as a mathematically specified module.

Global verification operates over the full reasoning trajectory. Its purpose is to organize the evidence chain, determine whether that chain sufficiently supports the answer, compare candidate solution paths, and request resampling or completion if support is insufficient (Team et al., 16 Mar 2026). The paper frames this as exploiting the asymmetry that verification is often easier than generation: the system may find it easier to judge whether a candidate answer is backed by coherent evidence than to generate the best reasoning path directly (Team et al., 16 Mar 2026).

The H1 paper reports a concrete hard-subset ablation on BrowseComp. On a subset of 295 questions where MiroThinker-1.7 often fails, MiroThinker-1.7 achieves Pass@1 of 32.1 with 1185.2 steps, while "MiroThinker-H1 w/ Local Verifier Only" achieves Pass@1 of 58.5 with 210.8 steps (Team et al., 16 Mar 2026). This shows that local verification is associated not only with higher accuracy but also with sharply fewer steps. A plausible interpretation is that the verifier improves search efficiency by pruning unproductive branches earlier.

The same paper also reports a compute-scaling result for H1 on BrowseComp: at 16× compute, the default benchmark budget, accuracy reaches 85.9, and at 64× compute it improves to 88.2 (Team et al., 16 Mar 2026). This suggests that H1 is designed as a heavy-duty mode whose verification procedures can exploit increased inference budget. The paper describes this as log-linear scaling of accuracy with compute (Team et al., 16 Mar 2026).

5. Benchmark performance and evaluation standing

The principal benchmark claims for MiroThinker-H1 come from two sources: the H1 system paper and the MiroEval benchmark paper (Team et al., 16 Mar 2026, Ye et al., 30 Mar 2026). In the former, H1 is reported to achieve 88.2 on BrowseComp, 84.4 on BrowseComp-ZH, 47.7 on HLE, 88.5 on GAIA, 72.0 on xbench-DeepSearch-2510, 61.3 on SEAL-0, and 80.6 on DeepSearchQA (Team et al., 16 Mar 2026). The same paper states that H1 surpasses Gemini-3.1-Pro and Claude-4.6-Opus on BrowseComp, Seed-2.0-Pro on BrowseComp-ZH, OpenAI-GPT-5 on GAIA by 12.1 points, and achieves the best score among listed models on SEAL-0 (Team et al., 16 Mar 2026).

The paper also reports specialized-domain results: 79.0 on FrontierSci-Olympiad, 51.3 on SUPERChem (text only), 73.9 on FinSearchComp, and 56.5 on MedBrowseComp (Team et al., 16 Mar 2026). It is said to be best on FrontierSci-Olympiad, FinSearchComp, and MedBrowseComp, while remaining competitive but not best on SUPERChem (Team et al., 16 Mar 2026).

On long-report evaluation using a 50-query deep research benchmark built with DeepResearchEval, H1 obtains Report Quality 76.8, Factuality 79.1, and Overall 78.0 (Team et al., 16 Mar 2026). The paper notes that this is the highest Report Quality among the listed agents, although not the highest Factuality (Team et al., 16 Mar 2026).

MiroEval situates H1 in a broader diagnostic framework for deep research systems. That benchmark comprises 100 tasks, including 70 text-only and 30 multimodal tasks, and evaluates systems along adaptive synthesis quality, agentic factuality, and process-centric dimensions (Ye et al., 30 Mar 2026). In that evaluation, MiroThinker-H1 ranks first overall in both settings. Its text-only scores are Synthesis 76.7, Factuality 81.1, Process 74.7, and Overall 77.5. Its multimodal scores are Synthesis 71.5, Factuality 78.5, Process 73.5, and Overall 74.5 (Ye et al., 30 Mar 2026).

The MiroEval paper emphasizes that H1 is distinguished by balanced performance rather than dominance on only one metric (Ye et al., 30 Mar 2026). In text-only evaluation, it produces 3746 correct claims and 161 wrong claims, which the paper identifies as the lowest absolute wrong-claim count of any system while maintaining high claim volume (Ye et al., 30 Mar 2026). In multimodal evaluation, it records 1316 correct claims, 82 wrong claims, 56 conflict labels, and 238 unknown labels, together with the best multimodal factuality ratio at 78.5 (Ye et al., 30 Mar 2026).

Its process scores are especially notable. In text-only evaluation, H1 attains Breadth 74.9, Depth 64.9, Refinement 72.2, Critical Thinking 69.1, Efficiency 71.0, P $H_t^{(e)} = \bigl\{(T_1, A_1, O_1), \ldots, (T_{t-1}, A_{t-1}, O_{t-1})\bigr\},$ 2R 87.0, R $H_t^{(e)} = \bigl\{(T_1, A_1, O_1), \ldots, (T_{t-1}, A_{t-1}, O_{t-1})\bigr\},$ 3P 63.3, Contradiction Detection 86.4, and Process Overall 74.7 (Ye et al., 30 Mar 2026). In multimodal evaluation, it attains Breadth 68.6, Depth 63.1, Refinement 73.4, Critical Thinking 71.0, Efficiency 64.1, P $H_t^{(e)} = \bigl\{(T_1, A_1, O_1), \ldots, (T_{t-1}, A_{t-1}, O_{t-1})\bigr\},$ 4R 86.6, R $H_t^{(e)} = \bigl\{(T_1, A_1, O_1), \ldots, (T_{t-1}, A_{t-1}, O_{t-1})\bigr\},$ 5P 63.4, Contradiction Detection 86.9, and Process Overall 73.5 (Ye et al., 30 Mar 2026). The same study reports a Pearson correlation of 0.88 between Process and combined outcome score, using this to argue that strong process quality is a reliable predictor of strong overall performance (Ye et al., 30 Mar 2026).

6. Interpretation, robustness, and limitations

The central interpretive claim across the H1 literature is that MiroThinker-H1 represents a shift from raw interaction depth to verified long-horizon reasoning (Team et al., 16 Mar 2026). The system paper frames this as "effective interaction scaling": stronger step-level competence from MiroThinker-1.7 plus local and global verification in H1 (Team et al., 16 Mar 2026). The MiroEval paper independently converges on a related conclusion, arguing that H1’s top rank derives from consistent strength across synthesis, factuality, and process, particularly process-report alignment and contradiction handling (Ye et al., 30 Mar 2026).

MiroEval also provides several robustness checks. On multimodal tasks, three reruns yield H1 overall scores of 74.5, 73.9, and 73.8, for a mean overall of 74.1 and standard deviation 0.3 (Ye et al., 30 Mar 2026). Under a different judge family, its rank remains unchanged, with $H_t^{(e)} = \bigl\{(T_1, A_1, O_1), \ldots, (T_{t-1}, A_{t-1}, O_{t-1})\bigr\},$ 6Rank $H_t^{(e)} = \bigl\{(T_1, A_1, O_1), \ldots, (T_{t-1}, A_{t-1}, O_{t-1})\bigr\},$ 7, and under a modified prompt its overall score changes by only $H_t^{(e)} = \bigl\{(T_1, A_1, O_1), \ldots, (T_{t-1}, A_{t-1}, O_{t-1})\bigr\},$ 8, again with no rank change (Ye et al., 30 Mar 2026). In a human ranking study, H1 obtains average human rank 1.8 while MiroEval assigns it rank 1; the paper reports overall human-vs-MiroEval agreement of Kendall’s $H_t^{(e)} = \bigl\{(T_1, A_1, O_1), \ldots, (T_{t-1}, A_{t-1}, O_{t-1})\bigr\},$ 9 and Spearman’s $T_i$ 0 (Ye et al., 30 Mar 2026).

Several limitations are also explicit. The MiroEval paper notes that process evaluation relies on systems exposing intermediate reasoning traces, which limits applicability to fully closed-source systems without such access (Ye et al., 30 Mar 2026). It also stresses that the benchmark evaluates deployed systems or products in a live, evolving setting rather than frozen model checkpoints, so rankings are time-bound and may be affected by backend changes, prompt-stack changes, tool availability, and web freshness (Ye et al., 30 Mar 2026). The factuality framework includes a CONFLICT label but does not resolve which conflicting source is correct (Ye et al., 30 Mar 2026).

The H1 system paper presents a different set of limits through omission. It gives no formal mathematical specification of the local or global verifier, no explicit verifier architecture, no scoring formula for evidence coherence, and no pseudocode for verifier-driven reranking (Team et al., 16 Mar 2026). Thus the existence and effectiveness of verification are well supported at the systems level, but the internal mechanism remains only partially specified in the published account (Team et al., 16 Mar 2026). It also does not state that H1 weights are released, whereas MiroThinker-1.7 and MiroThinker-1.7-mini are explicitly released as open-source models (Team et al., 16 Mar 2026).

Taken together, the literature presents MiroThinker-H1 as a verification-augmented deep research system whose technical significance lies in combining a trained research-agent foundation with inference-time auditing of intermediate decisions and full evidence chains (Team et al., 16 Mar 2026). Within the evaluation ecosystem built around deep research agents, it is characterized as the most balanced system in both text-only and multimodal settings, with especially strong process metrics and robust ranking stability under judge and prompt variation (Ye et al., 30 Mar 2026). Family-level context from the earlier MiroThinker work further suggests that H1 belongs to a line of agents organized around long context, rich tool use, and interaction scaling, though the v1.0 paper does not itself name H1 and therefore supports that connection only at the level of family continuity rather than direct specification (Team et al., 14 Nov 2025).