Agent-Native Mid-Training Paradigm

Updated 18 February 2026

Agent-Native Mid-Training Paradigm is a training approach that internalizes agentic behaviors within model parameters using authentic action-observation feedback sequences.
It leverages methods such as trajectory-centric modeling, reinforcement learning, self-reflection, and multi-agent coordination to optimize agent workflows.
Empirical results demonstrate improved metrics, like a 58.5% resolution rate and a +14.4% performance lift, underscoring its scalability and practical effectiveness.

Agent-Native Mid-Training Paradigm refers to a spectrum of training algorithms and data regimes for instilling agentic behaviors—such as planning, tool-use, reflection, or multi-agent coordination—directly within model or agent parameters, at an intermediate stage between pretraining and downstream post-training. Unlike pipeline-based systems, where agentic capabilities are modularized and orchestrated by external scripts or policies, agent-native mid-training emphasizes the internalization and direct optimization of agent workflows, feedback loops, and interactive environment signals, either within fixed model weights or via structured adaptive context. This paradigm encompasses a wide array of methodologies: trajectory-centric next-token modeling on agentic data, in-place reinforcement learning from agent execution, meta tool learning and self-reflection without weight updates, multi-agent mid-training with interaction-centric objectives, and function-library adaptation with model weights frozen. Across domains—software engineering, STEM, multimodal reasoning, tool-based research, and multi-agent environments—agent-native mid-training marks a methodological pivot towards agents that learn, adapt, and specialize through authentic, dynamic experience rather than static demonstrations or scripted scaffolds.

1. Defining Agent-Native Mid-Training: Foundations and Motivation

Agent-native mid-training (ANMT) is the intermediate training regime that bridges large-scale unsupervised pretraining and data-intensive supervised or reinforcement learning (RL) post-training. Its key distinction is the use of agent-native data: supervision that encodes the full action-observation-feedback sequences native to autonomous agents operating in authentic environments. In formal terms, given a policy $\pi_\theta$ , observations $\mathrm{Obs}$ , and an agent-native trajectory corpus $\mathcal{D}_{\text{MT}}$ , the objective is:

$\mathcal{L}_{\rm MT}(\theta) = -\sum_{(x,y) \in \mathcal{D}_{\rm MT}} \sum_{t=1}^{|y|} \log p_\theta(y_t|x_{<t})$

while ensuring that each sample $(x,y)$ encodes multi-step sequences $\{(a_t, o_t)\}_{t=1}^T$ as encountered by a deployed agent (Zeng et al., 26 Jan 2026).

ANMT is motivated by the observed inefficiencies and domain gaps of both static pretraining—which lacks dynamic, interactive state-action feedback—and pure RL, which is computationally prohibitive and constrained by the base model’s representational limits. By leveraging supervised learning on authentic agentic rollouts at scale, ANMT offers a scalable, data-efficient, and task-general paradigm for injecting agentic priors and behaviors (Zeng et al., 26 Jan 2026, Lu et al., 31 Dec 2025).

2. Categories of Agent-Native Data and Trajectories

Across ANMT implementations, the foundation lies in the careful construction or synthesis of agent-native data, which falls broadly into the following categories:

Type	Description	Example Domains
Contextually-native	Full action-observation history, navigation context, edit traces	Code PR workflows (Zeng et al., 26 Jan 2026)
Environmentally-native	Real environment interactive feedback (tool output, errors)	Executable code edits (Zeng et al., 26 Jan 2026)
Agentic–Chain-of-Thought	Structured multi-phase (plan, act, reflect) generator trajectories	STEM, math, research (Lu et al., 31 Dec 2025)
Multi-agent interactive	Self-play or multi-agent dialogs; role-specific decision logs	Coordination, theory-of-mind (Hu et al., 9 Dec 2025)
Task-oriented machine tokens	LLM-learned embedding-based message trajectories	Agent communication (Xiao et al., 29 Jul 2025)
Self-Reflection & Meta-Tool	Episodic reflection, tool-use logs, context augmentation	Knowledge agents (Qian et al., 1 Aug 2025)
Offline function editing	Incremental library synthesis/edit traces with failure feedback	Symbolic reasoning, QA (Zhang et al., 2024)

For example, daVinci-Dev synthesizes both contextually-native (from pull requests and repo navigation) and environmentally-native (from in-Docker agent execution and tool feedback) code trajectories, enabling large-scale exposure to authentic agentic feedback loops (Zeng et al., 26 Jan 2026). In Youtu-LLM, agentic mid-training leverages over 200B tokens of structured math, code, research, and tool-use trajectories, each annotated with explicit phase tags (<Analysis>, <Plan>, <Action>, etc.), branching at critical action or failure points (Lu et al., 31 Dec 2025).

3. Training Algorithms and Optimization Objectives

The ANMT paradigm encompasses varying learning modalities and objectives, including:

Trajectory-centric next-token modeling: Standard cross-entropy loss on assistant- or agent-output tokens in agentic trajectories, often with segment masking to focus learning on agent behaviors (Lu et al., 31 Dec 2025, Zeng et al., 26 Jan 2026).
Reinforcement learning from native execution (Agent Lightning): Treats agent executions as Markov Decision Processes (MDPs), logging tuples $(s_t, a_t, r_t)$ at runtime and applying hierarchical RL (e.g., GRPO) on transitions without altering agent orchestration logic (Luo et al., 5 Aug 2025).

$\mathcal{L}(\theta) = -\,\mathbb{E}_{x\sim\mathcal{X}}\,\mathbb{E}_{(s_t, a_t)\sim\tau} \left[\,\sum_{j=1}^{N_t}\log\pi_\theta(y_{t,j}\mid s_t,y_{t,<j})\,(G_t - b_x) \right]$

Meta tool learning and context synthesis (MetaAgent): Incorporates self-reflection and verified reflection into dynamic context banks and in-house knowledge bases to shape agent behavior without parameter updates (Qian et al., 1 Aug 2025).
Early experience and self-reflection (Agent Learning via Early Experience): Generates agent rollouts from current policies, applying auxiliary objectives for implicit world modeling and natural language rationale generation to improve policy grounding and reasoning (Zhang et al., 9 Oct 2025).
Function library optimization with frozen models: Treats function sets $F$ as “agent parameters”, using an LLM-based optimizer to incrementally edit $F$ with roll-back and early-stop mechanisms, while the core model weights remain unchanged (Zhang et al., 2024).
Multi-agent losses (native multi-agent mid-training): Combines understanding (theory-of-mind), joint planning, communication efficiency, and adaptation, each with specific loss functions, interleaved within minibatches according to a multi-task mixture (Hu et al., 9 Dec 2025).

4. Architectural and System Considerations

Agent-native mid-training unifies models, data, and environment interfaces:

Execution-agent/trainer decoupling (Agent Lightning): Direct observability frameworks allow off-policy RL algorithms to consume agent-generated transitions with negligible code overhead; native agent workflows remain untouched (Luo et al., 5 Aug 2025).
Multi-stage curricula: Progressive exposure to agentic trajectories is embedded as the final phase in multi-stage pretraining (e.g., commonsense → STEM → agentic) to facilitate internalization of planning and reflection (Lu et al., 31 Dec 2025).
Long-context support and specialized vocabularies: Architectures such as Multi-Latent Attention (MLA) and STEM-optimized vocabularies are deployed to handle long-horizon agentic sequences and reduce compression overhead (Lu et al., 31 Dec 2025).
Unified communication and representation (machine language tokens): Agents are trained to generate and interpret specialized token embeddings as an efficient channel for agent-to-agent or multi-modal communication, jointly optimized for task loss and embedding robustness (Xiao et al., 29 Jul 2025).

5. Empirical Evidence and Scaling Laws

Experimental studies across benchmarks and domains validate the efficacy of ANMT:

On SWE-Bench Verified, daVinci-Dev Qwen2.5-72B with full agent-native mid-training achieves 58.5% resolution rate, outperforming the agentless Kimi-Dev MT (48.6%) with less than half the training tokens (Zeng et al., 26 Jan 2026).
Youtu-LLM’s agentic mid-training yields a +14.4% average relative improvement across six downstream agentic benchmarks, with ~42.7% relative lift in code Pass@1 at k = 1 (Lu et al., 31 Dec 2025).
Agent Lightning demonstrates stable performance improvement when deploying RL training in situ across text-to-SQL, retrieval-augmented generation, and math tool-use agents, with no agent code modification (Luo et al., 5 Aug 2025).
MetaAgent’s meta tool learning elevates a frozen LLM from novice to expert-level tool reasoning without any weight updates, with ablations showing up to ~8-point drop in EM if the in-house knowledge base is omitted (Qian et al., 1 Aug 2025).
Agent Learning via Early Experience records +9.6 points in in-domain and +9.4 in out-of-domain generalization over imitation-only baselines, with further gains in RL-ready settings (Zhang et al., 9 Oct 2025).
Scaling analysis in Youtu-LLM indicates logarithmic agentic performance scaling with mid-training token size, and unsaturated learning curves with increasing scale (Lu et al., 31 Dec 2025, Zeng et al., 26 Jan 2026).

6. Limitations, Challenges, and Future Research Directions

Challenges and open areas in agent-native mid-training include:

Data authenticity and coverage: Ensuring the breadth (contextual diversity) and depth (authentic feedback) of agentic trajectories remains a bottleneck for non-code domains and underexplored environments (Zeng et al., 26 Jan 2026).
Privacy and reproducibility: Persisting developer identifiers, reliance on patched test harnesses, and single-model evaluations restrict generalizability and reproducibility in code domains (Zeng et al., 26 Jan 2026).
Multi-agent scaling: Pure single-agent scaling does not spontaneously yield robust multi-agent intelligence, as shown by plateaus on ToMBench and CoordinationQA without targeted multi-agent mid-training (Hu et al., 9 Dec 2025).
Sample efficiency and credit assignment: Advanced techniques for hierarchical credit assignment and value learning promise further improvement in RL-based agent optimization, especially in distributed or tool-rich settings (Luo et al., 5 Aug 2025).
Beyond RL and LLMs: Adaptive prompt optimization and modalities outside text (audio, vision, tactile) are conceptual extensions of the paradigm (Xiao et al., 29 Jul 2025, Lu et al., 31 Dec 2025).
Curricular and architectural synergy: The synergy of contextually- and environmentally-native data, progressive curricula, and scalable architectures underlies robust agentic skill acquisition (Lu et al., 31 Dec 2025, Zeng et al., 26 Jan 2026).

A plausible implication is that as mid-training scales in both trajectory diversity and size, the marginal gains in agentic behavior are maintained, providing a pathway to increasingly general, robust, and intrinsically agentic models.

7. Comparative Table: Key Agent-Native Mid-Training Paradigms

Approach	Core Mechanism	Domain	Weight Update	Empirical Gains
daVinci-Dev	Trajectory-centric next-token modeling	Code	Yes	Pass@1 up to 58.5% on SWE-Bench Verified (Zeng et al., 26 Jan 2026)
Youtu-LLM	Scalable agentic data curriculum	General/STEM	Yes	+14.4% avg. lift, log-linear scaling with data size (Lu et al., 31 Dec 2025)
MetaAgent	Meta tool learning, context enrichment	Web research	No	EM up to 52.1%; ablation drops up to 8 pts on EM (Qian et al., 1 Aug 2025)
Agent Lightning	Agent execution → RL on MDP	Any	Yes	Continuous improvement in SQL, RAG, math tool-use (Luo et al., 5 Aug 2025)
Early Experience	Self-reflective rollout modeling	General	Yes	+9.6 points in-domain, +9.4 OOD, improved post-RL ceiling (Zhang et al., 9 Oct 2025)
Function Editing	LLM-driven library optimization	Symbolic/math	No	+3–11% on MATH, TabMWP, GAIA (Zhang et al., 2024)
Machine Language Tokens	Embedding-based comm.	Multi-modal	Yes	Compression ratio ≈0.01, <5% accuracy loss at low SNR (Xiao et al., 29 Jul 2025)
Native Multi-Agent	Multi-agent loss interleaving	Multi-agent	Yes	Blueprint only; required to surpass coordination accuracy plateaus (Hu et al., 9 Dec 2025)