Agent-Native Mid-Training Paradigm
- Agent-Native Mid-Training Paradigm is a training approach that internalizes agentic behaviors within model parameters using authentic action-observation feedback sequences.
- It leverages methods such as trajectory-centric modeling, reinforcement learning, self-reflection, and multi-agent coordination to optimize agent workflows.
- Empirical results demonstrate improved metrics, like a 58.5% resolution rate and a +14.4% performance lift, underscoring its scalability and practical effectiveness.
Agent-Native Mid-Training Paradigm refers to a spectrum of training algorithms and data regimes for instilling agentic behaviors—such as planning, tool-use, reflection, or multi-agent coordination—directly within model or agent parameters, at an intermediate stage between pretraining and downstream post-training. Unlike pipeline-based systems, where agentic capabilities are modularized and orchestrated by external scripts or policies, agent-native mid-training emphasizes the internalization and direct optimization of agent workflows, feedback loops, and interactive environment signals, either within fixed model weights or via structured adaptive context. This paradigm encompasses a wide array of methodologies: trajectory-centric next-token modeling on agentic data, in-place reinforcement learning from agent execution, meta tool learning and self-reflection without weight updates, multi-agent mid-training with interaction-centric objectives, and function-library adaptation with model weights frozen. Across domains—software engineering, STEM, multimodal reasoning, tool-based research, and multi-agent environments—agent-native mid-training marks a methodological pivot towards agents that learn, adapt, and specialize through authentic, dynamic experience rather than static demonstrations or scripted scaffolds.
1. Defining Agent-Native Mid-Training: Foundations and Motivation
Agent-native mid-training (ANMT) is the intermediate training regime that bridges large-scale unsupervised pretraining and data-intensive supervised or reinforcement learning (RL) post-training. Its key distinction is the use of agent-native data: supervision that encodes the full action-observation-feedback sequences native to autonomous agents operating in authentic environments. In formal terms, given a policy , observations , and an agent-native trajectory corpus , the objective is:
while ensuring that each sample encodes multi-step sequences as encountered by a deployed agent (Zeng et al., 26 Jan 2026).
ANMT is motivated by the observed inefficiencies and domain gaps of both static pretraining—which lacks dynamic, interactive state-action feedback—and pure RL, which is computationally prohibitive and constrained by the base model’s representational limits. By leveraging supervised learning on authentic agentic rollouts at scale, ANMT offers a scalable, data-efficient, and task-general paradigm for injecting agentic priors and behaviors (Zeng et al., 26 Jan 2026, Lu et al., 31 Dec 2025).
2. Categories of Agent-Native Data and Trajectories
Across ANMT implementations, the foundation lies in the careful construction or synthesis of agent-native data, which falls broadly into the following categories:
| Type | Description | Example Domains |
|---|---|---|
| Contextually-native | Full action-observation history, navigation context, edit traces | Code PR workflows (Zeng et al., 26 Jan 2026) |
| Environmentally-native | Real environment interactive feedback (tool output, errors) | Executable code edits (Zeng et al., 26 Jan 2026) |
| Agentic–Chain-of-Thought | Structured multi-phase (plan, act, reflect) generator trajectories | STEM, math, research (Lu et al., 31 Dec 2025) |
| Multi-agent interactive | Self-play or multi-agent dialogs; role-specific decision logs | Coordination, theory-of-mind (Hu et al., 9 Dec 2025) |
| Task-oriented machine tokens | LLM-learned embedding-based message trajectories | Agent communication (Xiao et al., 29 Jul 2025) |
| Self-Reflection & Meta-Tool | Episodic reflection, tool-use logs, context augmentation | Knowledge agents (Qian et al., 1 Aug 2025) |
| Offline function editing | Incremental library synthesis/edit traces with failure feedback | Symbolic reasoning, QA (Zhang et al., 2024) |
For example, daVinci-Dev synthesizes both contextually-native (from pull requests and repo navigation) and environmentally-native (from in-Docker agent execution and tool feedback) code trajectories, enabling large-scale exposure to authentic agentic feedback loops (Zeng et al., 26 Jan 2026). In Youtu-LLM, agentic mid-training leverages over 200B tokens of structured math, code, research, and tool-use trajectories, each annotated with explicit phase tags (<Analysis>, <Plan>, <Action>, etc.), branching at critical action or failure points (Lu et al., 31 Dec 2025).
3. Training Algorithms and Optimization Objectives
The ANMT paradigm encompasses varying learning modalities and objectives, including:
- Trajectory-centric next-token modeling: Standard cross-entropy loss on assistant- or agent-output tokens in agentic trajectories, often with segment masking to focus learning on agent behaviors (Lu et al., 31 Dec 2025, Zeng et al., 26 Jan 2026).
- Reinforcement learning from native execution (Agent Lightning): Treats agent executions as Markov Decision Processes (MDPs), logging tuples at runtime and applying hierarchical RL (e.g., GRPO) on transitions without altering agent orchestration logic (Luo et al., 5 Aug 2025).
- Meta tool learning and context synthesis (MetaAgent): Incorporates self-reflection and verified reflection into dynamic context banks and in-house knowledge bases to shape agent behavior without parameter updates (Qian et al., 1 Aug 2025).
- Early experience and self-reflection (Agent Learning via Early Experience): Generates agent rollouts from current policies, applying auxiliary objectives for implicit world modeling and natural language rationale generation to improve policy grounding and reasoning (Zhang et al., 9 Oct 2025).
- Function library optimization with frozen models: Treats function sets as “agent parameters”, using an LLM-based optimizer to incrementally edit with roll-back and early-stop mechanisms, while the core model weights remain unchanged (Zhang et al., 2024).
- Multi-agent losses (native multi-agent mid-training): Combines understanding (theory-of-mind), joint planning, communication efficiency, and adaptation, each with specific loss functions, interleaved within minibatches according to a multi-task mixture (Hu et al., 9 Dec 2025).
4. Architectural and System Considerations
Agent-native mid-training unifies models, data, and environment interfaces:
- Execution-agent/trainer decoupling (Agent Lightning): Direct observability frameworks allow off-policy RL algorithms to consume agent-generated transitions with negligible code overhead; native agent workflows remain untouched (Luo et al., 5 Aug 2025).
- Multi-stage curricula: Progressive exposure to agentic trajectories is embedded as the final phase in multi-stage pretraining (e.g., commonsense → STEM → agentic) to facilitate internalization of planning and reflection (Lu et al., 31 Dec 2025).
- Long-context support and specialized vocabularies: Architectures such as Multi-Latent Attention (MLA) and STEM-optimized vocabularies are deployed to handle long-horizon agentic sequences and reduce compression overhead (Lu et al., 31 Dec 2025).
- Unified communication and representation (machine language tokens): Agents are trained to generate and interpret specialized token embeddings as an efficient channel for agent-to-agent or multi-modal communication, jointly optimized for task loss and embedding robustness (Xiao et al., 29 Jul 2025).
5. Empirical Evidence and Scaling Laws
Experimental studies across benchmarks and domains validate the efficacy of ANMT:
- On SWE-Bench Verified, daVinci-Dev Qwen2.5-72B with full agent-native mid-training achieves 58.5% resolution rate, outperforming the agentless Kimi-Dev MT (48.6%) with less than half the training tokens (Zeng et al., 26 Jan 2026).
- Youtu-LLM’s agentic mid-training yields a +14.4% average relative improvement across six downstream agentic benchmarks, with ~42.7% relative lift in code Pass@1 at k = 1 (Lu et al., 31 Dec 2025).
- Agent Lightning demonstrates stable performance improvement when deploying RL training in situ across text-to-SQL, retrieval-augmented generation, and math tool-use agents, with no agent code modification (Luo et al., 5 Aug 2025).
- MetaAgent’s meta tool learning elevates a frozen LLM from novice to expert-level tool reasoning without any weight updates, with ablations showing up to ~8-point drop in EM if the in-house knowledge base is omitted (Qian et al., 1 Aug 2025).
- Agent Learning via Early Experience records +9.6 points in in-domain and +9.4 in out-of-domain generalization over imitation-only baselines, with further gains in RL-ready settings (Zhang et al., 9 Oct 2025).
- Scaling analysis in Youtu-LLM indicates logarithmic agentic performance scaling with mid-training token size, and unsaturated learning curves with increasing scale (Lu et al., 31 Dec 2025, Zeng et al., 26 Jan 2026).
6. Limitations, Challenges, and Future Research Directions
Challenges and open areas in agent-native mid-training include:
- Data authenticity and coverage: Ensuring the breadth (contextual diversity) and depth (authentic feedback) of agentic trajectories remains a bottleneck for non-code domains and underexplored environments (Zeng et al., 26 Jan 2026).
- Privacy and reproducibility: Persisting developer identifiers, reliance on patched test harnesses, and single-model evaluations restrict generalizability and reproducibility in code domains (Zeng et al., 26 Jan 2026).
- Multi-agent scaling: Pure single-agent scaling does not spontaneously yield robust multi-agent intelligence, as shown by plateaus on ToMBench and CoordinationQA without targeted multi-agent mid-training (Hu et al., 9 Dec 2025).
- Sample efficiency and credit assignment: Advanced techniques for hierarchical credit assignment and value learning promise further improvement in RL-based agent optimization, especially in distributed or tool-rich settings (Luo et al., 5 Aug 2025).
- Beyond RL and LLMs: Adaptive prompt optimization and modalities outside text (audio, vision, tactile) are conceptual extensions of the paradigm (Xiao et al., 29 Jul 2025, Lu et al., 31 Dec 2025).
- Curricular and architectural synergy: The synergy of contextually- and environmentally-native data, progressive curricula, and scalable architectures underlies robust agentic skill acquisition (Lu et al., 31 Dec 2025, Zeng et al., 26 Jan 2026).
A plausible implication is that as mid-training scales in both trajectory diversity and size, the marginal gains in agentic behavior are maintained, providing a pathway to increasingly general, robust, and intrinsically agentic models.
7. Comparative Table: Key Agent-Native Mid-Training Paradigms
| Approach | Core Mechanism | Domain | Weight Update | Empirical Gains |
|---|---|---|---|---|
| daVinci-Dev | Trajectory-centric next-token modeling | Code | Yes | Pass@1 up to 58.5% on SWE-Bench Verified (Zeng et al., 26 Jan 2026) |
| Youtu-LLM | Scalable agentic data curriculum | General/STEM | Yes | +14.4% avg. lift, log-linear scaling with data size (Lu et al., 31 Dec 2025) |
| MetaAgent | Meta tool learning, context enrichment | Web research | No | EM up to 52.1%; ablation drops up to 8 pts on EM (Qian et al., 1 Aug 2025) |
| Agent Lightning | Agent execution → RL on MDP | Any | Yes | Continuous improvement in SQL, RAG, math tool-use (Luo et al., 5 Aug 2025) |
| Early Experience | Self-reflective rollout modeling | General | Yes | +9.6 points in-domain, +9.4 OOD, improved post-RL ceiling (Zhang et al., 9 Oct 2025) |
| Function Editing | LLM-driven library optimization | Symbolic/math | No | +3–11% on MATH, TabMWP, GAIA (Zhang et al., 2024) |
| Machine Language Tokens | Embedding-based comm. | Multi-modal | Yes | Compression ratio ≈0.01, <5% accuracy loss at low SNR (Xiao et al., 29 Jul 2025) |
| Native Multi-Agent | Multi-agent loss interleaving | Multi-agent | Yes | Blueprint only; required to surpass coordination accuracy plateaus (Hu et al., 9 Dec 2025) |