Create a Video View Paper

Tongyi DeepResearch: Scaling Agentic AI for Autonomous Research

This presentation explores Tongyi DeepResearch, a scalable agentic language model designed for deep information-seeking tasks. Built on a 30.5B parameter architecture with only 3.3B activated per token, it introduces a unified training paradigm combining agentic mid-training and post-training with fully automated synthetic data pipelines. The system achieves state-of-the-art performance across seven research benchmarks, outperforming larger proprietary models through innovative environment design, reinforcement learning, and context management strategies.

Script

Most research assistants can answer simple questions. But ask them to investigate something truly uncertain, something requiring 50 steps of reasoning across multiple sources, and they collapse. Tongyi DeepResearch changes that with a system that matches human researchers on the hardest information-seeking tasks, using only 3.3 billion active parameters.

Traditional language models learn language, then learn to follow instructions. The authors introduce a third stage: agentic mid-training, where the model learns to think like an agent before it ever sees a single task. This inductive bias, instilled through massive synthetic trajectories, makes the difference between a model that can answer and one that can investigate.

That training requires data no human could afford to label.

The researchers built a fully automated synthesis pipeline that generates millions of research trajectories without a single human annotation. On the left, entity-anchored memories spawn genuinely uncertain questions. On the right, decomposition models and simulated environments create diverse action sequences. The result is training data that would cost millions to label manually, produced at scale.

The reinforcement learning infrastructure operates entirely on-policy, with asynchronous rollout servers handling model inference and tool invocation simultaneously. Each agent interacts with real search engines, simulated databases, and knowledge sources, accumulating experience that feeds directly back into policy updates. The sandboxed architecture handles API failures and rate limits gracefully, making real-world learning practical at scale.

Environments are not passive backdrops but active training components. Mid-training uses zero-cost simulated worlds to bootstrap millions of trajectories. Reinforcement learning then deploys into real search engines and live APIs, where the agent must handle latency, errors, and non-stationarity. The Markovian memory trick solves context overflow: compress everything into a running report, condition only on recent steps.

The training paradigm delivers measurable advantages.

Tongyi DeepResearch outperforms every major proprietary system, including OpenAI o3 and DeepSeek V3.1, despite activating only 3.3 billion parameters per token compared to their hundreds of billions. The heavy mode, which runs parallel agents and synthesizes their outputs, pushes performance even higher: 38.3 percent on Humanity's Last Exam, a benchmark designed to resist even the best models. This is not incremental improvement. This is architectural efficiency.

The experiments reveal two scaling laws. First, context length matters: 128K token models outperform 32K models, but shorter contexts teach the agent to compress and prioritize. Second, interaction depth scales performance linearly on research tasks. More search turns, more reasoning steps, higher accuracy. The reinforcement learning curves are textbook stable, no collapse, no plateaus, just steady improvement as the agent learns to navigate uncertainty.

Tongyi DeepResearch proves that agentic intelligence does not require frontier-scale models. With the right training paradigm, synthetic data at scale, and environments that teach rather than evaluate, a 3.3 billion parameter agent can outresearch systems a hundred times its size. You can explore this paper in depth and create your own research video at EmergentMind.com.