MiroThinker v1.0 Research Agent
- MiroThinker v1.0 is an open-source research agent that integrates model size, context length, and interaction depth to advance tool-augmented reasoning.
- It alternates between LLM-driven thought steps and structured tool actions, using reinforcement learning to refine and correct its multi-step decision process.
- The 72B-parameter variant achieves state-of-the-art benchmark results, underscoring the significance of interaction scaling for adaptive, real-world problem-solving.
MiroThinker v1.0 is an open-source research agent engineered to advance tool-augmented reasoning and information-seeking by integrating model size, context length, and—distinctively—interaction depth as fundamental axes of scaling. Building on the ReAct agent paradigm, MiroThinker alternates between LLM-driven “thought” steps and structured tool-based actions (e.g., search, code execution, web scraping), incorporating real-time environment feedback within its decision loop. Through reinforcement learning (RL), the agent is systematically trained to operate across substantially deeper agent–environment interaction trajectories than prior agents, achieving performance on par with proprietary research assistants. The 72B-parameter MiroThinker demonstrates state-of-the-art open-source performance across a suite of real-world research benchmarks, establishing interaction scaling as a third critical dimension for next-generation research agents (Team et al., 14 Nov 2025).
1. System Architecture and Scaling Dimensions
MiroThinker v1.0 is designed according to the ReAct framework, comprising two alternating modules: an LLM “thought” module and an action policy issuing tool calls. Each observation resulting from a tool invocation is fed back into the agent’s context, contributing to subsequent reasoning steps until the agent decides on a final answer. This architecture is illustrated in Figure 1 (“Overview of the MiroThinker v1.0 agent architecture”) in the original work.
The agent is released in three parameter scales—8B, 30B, and 72B—permitting controlled exploration of the relationship between model size and downstream accuracy. Contextual information is accommodated within a 256K-token sliding window using recency-based retention (budget recent tool observations) and output truncation strategies, enabling the agent to perform up to 600 tool calls per task.
Notably, MiroThinker identifies interaction depth—the number of thought–action–observation turns on a task—as a synergistic scaling axis. Scaling these three dimensions jointly demonstrates that larger models leverage longer context and more complex interaction histories, while extended contexts permit deeper trajectories, and increased interactive depth yields richer error-correction and evidence collection.
2. Interactive Scaling Methodology
Rather than permitting arbitrary chain depth only at inference (test-time scaling), MiroThinker is explicitly trained to manage long, corrective reasoning chains through a reinforcement learning regime. Test-time scaling in frozen LLMs can degrade with long trajectories (e.g., error propagation, hallucinations), but RL tuning guides the model to harness environment feedback and external information for robust, multi-step interaction.
Reinforcement Learning Protocol:
- State: The recency-filtered agent history comprises all previous thoughts/actions and the most recent tool observations.
- Action: At each step, where the policy may invoke a structured tool (e.g.,
google_search(query),run_python_code(script)) or choose to terminate (). - Reward: For a trajectory on input ,
where is an LLM-graded correctness indicator and penalizes output-format violations. Weights are tuned for exploration and compliance.
- Optimization: Using Group Relative Policy Optimization (GRPO), the model samples a group of trajectories for each query, computes group-mean–centered advantages ,
and updates model parameters to maximize
During rollout, each tool call is executed in a Linux sandbox; the resulting observation is appended to the trajectory and, if within the recency budget, incorporated into agent state for subsequent actions. Trajectories graded by an LLM and trivial failures are filtered prior to reward computation and policy update.
3. Data Sources, Tool Suite, and Training Protocol
Model Variants and Context:
- Scale: 8B, 30B, and 72B parameters.
- Context: 256K-token window, for recent observation retention, output truncation on lengthy tool calls.
Integrated Tools and Execution Environment:
- Linux sandbox with
run_command,run_python_code. - File I/O via
upload_file*anddownload_file*. - Information retrieval through
google_search, and LLM-driven content extraction withscrape_and_extract_info. - All tools operate in a security sandbox; access to HuggingFace is disabled to prevent benchmark leakage.
Training Dataset Sources:
- Synthetic multi-document QA: built by hyperlink mapping from Wikipedia/Common Crawl, fact extraction, constraint obfuscation, and LLM-generated multi-hop queries.
- Agentic trajectories synthesized under ReAct (single-agent) and MiroFlow (multi-agent) paradigms, spanning diverse LLMs (GPT-OSS, DeepSeek-V3.1), and employing both Function Calling and Model Context Protocols.
- Supplementary datasets: MusiQue, HotpotQA, WebWalkerQA-Silver, MegaScience, TaskCraft, WebShaper, WebDancer, Toucan-1.5M, and others converted for tool-augmented use.
- Post-training on distilled chain-of-thought and conversational corpora to maintain dialogue competence.
RL Hyperparameters:
- GRPO group size
- KL penalty
- Reward weights , (tuned)
- Streaming rollouts with task queue; abort/retry trajectories are filtered.
4. Benchmark Evaluation and Scaling Results
MiroThinker v1.0 was evaluated on four real-world, tool-augmented benchmarks: GAIA, Humanity’s Last Exam (HLE), BrowseComp, and BrowseComp-ZH.
Performance Table:
| Model | HLE | BrowseComp | BrowseComp-ZH | GAIA |
|---|---|---|---|---|
| MiniMax-M2 | 31.8 | 44.0 | 48.5 | 75.7 |
| GPT-5-high | 35.2 | 54.9 | 65.0 | 76.4 |
| MiroThinker-8B | 21.5 | 31.1 | 40.2 | 66.4 |
| MiroThinker-30B | 33.4 | 41.2 | 47.8 | 73.5 |
| MiroThinker-72B | 37.7 | 47.1 | 55.6 | 81.9 |
The 72B model achieves 81.9% on GAIA, 37.7% on HLE, 47.1% on BrowseComp, and 55.6% on BrowseComp-ZH, outperforming the strongest open-source baselines by 4–6 percentage points and approaching results from commercial agents such as GPT-5-high.
Interaction Depth Scaling:
Empirical analysis shows that RL-tuned agents increase mean tool call depth () per query by approximately $2$–, conferring an $8$–$10$ percentage point gain in accuracy. Across benchmarks, accuracy as a function of interaction depth fits a logarithmic relationship:
with for BrowseComp-ZH, as detailed in Figure 2 (“Illustration of interactive scaling”). This result substantiates interaction depth as a third axis of scale.
5. Discussion, Limitations, and Implications
Interactive scaling enables the agent to iteratively refine hypotheses, correct prior errors, and accumulate evidence from the external environment, supplementing the effect of model size and context length. Deeper interaction trajectories functionally extend the working memory of the model, making complex, long-horizon workflows feasible within a bounded context window by leveraging recency-based retention.
RL-trained policies balance exploration (enumerating new tool calls to seek missing evidence) and exploitation (terminating and outputting an answer). However, several limitations are noted:
- The RL agent may generate redundant or marginally useful tool calls. More nuanced reward shaping for “useful” calls is a direction for improvement.
- Reinforcement learning can induce verbose chains-of-thought; integrating distillation or length penalties may compress long trajectories.
- Occasional mixing of languages in non-English queries indicates that stronger multilingual alignment is necessary.
- Errors in managing sandbox IDs and inefficient code generation suggest that additional fine-tuning on code execution tasks may be beneficial.
By establishing interaction scaling as an essential, reproducible axis for agent performance, MiroThinker v1.0 underpins the development of open research agents proficient at iterative, adaptive, and tool-augmented problem-solving in real-world environments (Team et al., 14 Nov 2025).
Sponsored by Paperpile, the PDF & BibTeX manager trusted by top AI labs.
Get 30 days free