Papers
Topics
Authors
Recent
2000 character limit reached

MiroThinker v1.0 Research Agent

Updated 18 November 2025
  • MiroThinker v1.0 is an open-source research agent that integrates model size, context length, and interaction depth to advance tool-augmented reasoning.
  • It alternates between LLM-driven thought steps and structured tool actions, using reinforcement learning to refine and correct its multi-step decision process.
  • The 72B-parameter variant achieves state-of-the-art benchmark results, underscoring the significance of interaction scaling for adaptive, real-world problem-solving.

MiroThinker v1.0 is an open-source research agent engineered to advance tool-augmented reasoning and information-seeking by integrating model size, context length, and—distinctively—interaction depth as fundamental axes of scaling. Building on the ReAct agent paradigm, MiroThinker alternates between LLM-driven “thought” steps and structured tool-based actions (e.g., search, code execution, web scraping), incorporating real-time environment feedback within its decision loop. Through reinforcement learning (RL), the agent is systematically trained to operate across substantially deeper agent–environment interaction trajectories than prior agents, achieving performance on par with proprietary research assistants. The 72B-parameter MiroThinker demonstrates state-of-the-art open-source performance across a suite of real-world research benchmarks, establishing interaction scaling as a third critical dimension for next-generation research agents (Team et al., 14 Nov 2025).

1. System Architecture and Scaling Dimensions

MiroThinker v1.0 is designed according to the ReAct framework, comprising two alternating modules: an LLM “thought” module and an action policy issuing tool calls. Each observation resulting from a tool invocation is fed back into the agent’s context, contributing to subsequent reasoning steps until the agent decides on a final answer. This architecture is illustrated in Figure 1 (“Overview of the MiroThinker v1.0 agent architecture”) in the original work.

The agent is released in three parameter scales—8B, 30B, and 72B—permitting controlled exploration of the relationship between model size and downstream accuracy. Contextual information is accommodated within a 256K-token sliding window using recency-based retention (budget K=5K=5 recent tool observations) and output truncation strategies, enabling the agent to perform up to 600 tool calls per task.

Notably, MiroThinker identifies interaction depth—the number of thought–action–observation turns on a task—as a synergistic scaling axis. Scaling these three dimensions jointly demonstrates that larger models leverage longer context and more complex interaction histories, while extended contexts permit deeper trajectories, and increased interactive depth yields richer error-correction and evidence collection.

2. Interactive Scaling Methodology

Rather than permitting arbitrary chain depth only at inference (test-time scaling), MiroThinker is explicitly trained to manage long, corrective reasoning chains through a reinforcement learning regime. Test-time scaling in frozen LLMs can degrade with long trajectories (e.g., error propagation, hallucinations), but RL tuning guides the model to harness environment feedback and external information for robust, multi-step interaction.

Reinforcement Learning Protocol:

  • State: The recency-filtered agent history H^t\widehat H_t comprises all previous thoughts/actions and the KK most recent tool observations.
  • Action: At each step, At=πθ(H^t,Tt)A_t = \pi_\theta(\widehat H_t, T_t) where the policy may invoke a structured tool (e.g., google_search(query), run_python_code(script)) or choose to terminate (\emptyset).
  • Reward: For a trajectory H={(Tt,At,Ot)}t=1TH=\{(T_t, A_t, O_t)\}_{t=1}^T on input xx,

R(x,H)=αcRcorrect(H)αfRformat(H)R(x,H) = \alpha_c R_{\mathrm{correct}}(H) - \alpha_f R_{\mathrm{format}}(H)

where RcorrectR_{\mathrm{correct}} is an LLM-graded correctness indicator and RformatR_{\mathrm{format}} penalizes output-format violations. Weights αc,αf\alpha_c, \alpha_f are tuned for exploration and compliance.

A^i=R(x,Hi)1Gj=1GR(x,Hj)\hat A_i = R(x,H_i) - \frac{1}{G}\sum_{j=1}^G R(x,H_j)

and updates model parameters to maximize

J(θ)=ExEHπθ(x)[A^(H)logπθ(Hx)βKLDKL(πθπref)].J(\theta) = \mathbb{E}_x\, \mathbb{E}_{H\sim\pi_\theta(\cdot|x)} \Bigl[ \hat{A}(H)\,\log \pi_\theta(H|x) - \beta_{\mathrm{KL}} D_{\mathrm{KL}}(\pi_\theta \| \pi_{\mathrm{ref}})\Bigr].

During rollout, each tool call is executed in a Linux sandbox; the resulting observation is appended to the trajectory and, if within the recency budget, incorporated into agent state for subsequent actions. Trajectories graded by an LLM and trivial failures are filtered prior to reward computation and policy update.

3. Data Sources, Tool Suite, and Training Protocol

Model Variants and Context:

  • Scale: 8B, 30B, and 72B parameters.
  • Context: 256K-token window, K=5K=5 for recent observation retention, output truncation on lengthy tool calls.

Integrated Tools and Execution Environment:

  • Linux sandbox with run_command, run_python_code.
  • File I/O via upload_file* and download_file*.
  • Information retrieval through google_search, and LLM-driven content extraction with scrape_and_extract_info.
  • All tools operate in a security sandbox; access to HuggingFace is disabled to prevent benchmark leakage.

Training Dataset Sources:

  • Synthetic multi-document QA: built by hyperlink mapping from Wikipedia/Common Crawl, fact extraction, constraint obfuscation, and LLM-generated multi-hop queries.
  • Agentic trajectories synthesized under ReAct (single-agent) and MiroFlow (multi-agent) paradigms, spanning diverse LLMs (GPT-OSS, DeepSeek-V3.1), and employing both Function Calling and Model Context Protocols.
  • Supplementary datasets: MusiQue, HotpotQA, WebWalkerQA-Silver, MegaScience, TaskCraft, WebShaper, WebDancer, Toucan-1.5M, and others converted for tool-augmented use.
  • Post-training on distilled chain-of-thought and conversational corpora to maintain dialogue competence.

RL Hyperparameters:

  • GRPO group size G16G \approx 16
  • KL penalty βKL0.01\beta_{\mathrm{KL}} \approx 0.01
  • Reward weights αc=1.0\alpha_c = 1.0, αf=0.5\alpha_f = 0.5 (tuned)
  • Streaming rollouts with task queue; abort/retry trajectories are filtered.

4. Benchmark Evaluation and Scaling Results

MiroThinker v1.0 was evaluated on four real-world, tool-augmented benchmarks: GAIA, Humanity’s Last Exam (HLE), BrowseComp, and BrowseComp-ZH.

Performance Table:

Model HLE BrowseComp BrowseComp-ZH GAIA
MiniMax-M2 31.8 44.0 48.5 75.7
GPT-5-high 35.2 54.9 65.0 76.4
MiroThinker-8B 21.5 31.1 40.2 66.4
MiroThinker-30B 33.4 41.2 47.8 73.5
MiroThinker-72B 37.7 47.1 55.6 81.9

The 72B model achieves 81.9% on GAIA, 37.7% on HLE, 47.1% on BrowseComp, and 55.6% on BrowseComp-ZH, outperforming the strongest open-source baselines by 4–6 percentage points and approaching results from commercial agents such as GPT-5-high.

Interaction Depth Scaling:

Empirical analysis shows that RL-tuned agents increase mean tool call depth (DD) per query by approximately $2$–3×3\times, conferring an $8$–$10$ percentage point gain in accuracy. Across benchmarks, accuracy as a function of interaction depth fits a logarithmic relationship:

Acc(D)alogD+b\mathrm{Acc}(D) \approx a\,\log D + b

with a0.12,b24.8a \approx 0.12, b \approx 24.8 for BrowseComp-ZH, as detailed in Figure 2 (“Illustration of interactive scaling”). This result substantiates interaction depth as a third axis of scale.

5. Discussion, Limitations, and Implications

Interactive scaling enables the agent to iteratively refine hypotheses, correct prior errors, and accumulate evidence from the external environment, supplementing the effect of model size and context length. Deeper interaction trajectories functionally extend the working memory of the model, making complex, long-horizon workflows feasible within a bounded context window by leveraging recency-based retention.

RL-trained policies balance exploration (enumerating new tool calls to seek missing evidence) and exploitation (terminating and outputting an answer). However, several limitations are noted:

  • The RL agent may generate redundant or marginally useful tool calls. More nuanced reward shaping for “useful” calls is a direction for improvement.
  • Reinforcement learning can induce verbose chains-of-thought; integrating distillation or length penalties may compress long trajectories.
  • Occasional mixing of languages in non-English queries indicates that stronger multilingual alignment is necessary.
  • Errors in managing sandbox IDs and inefficient code generation suggest that additional fine-tuning on code execution tasks may be beneficial.

By establishing interaction scaling as an essential, reproducible axis for agent performance, MiroThinker v1.0 underpins the development of open research agents proficient at iterative, adaptive, and tool-augmented problem-solving in real-world environments (Team et al., 14 Nov 2025).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)
Forward Email Streamline Icon: https://streamlinehq.com

Follow Topic

Get notified by email when new papers are published related to MiroThinker v1.0.