Papers
Topics
Authors
Recent
Search
2000 character limit reached

Agent-Native Mid-Training Paradigm

Updated 18 February 2026
  • Agent-Native Mid-Training Paradigm is a training approach that internalizes agentic behaviors within model parameters using authentic action-observation feedback sequences.
  • It leverages methods such as trajectory-centric modeling, reinforcement learning, self-reflection, and multi-agent coordination to optimize agent workflows.
  • Empirical results demonstrate improved metrics, like a 58.5% resolution rate and a +14.4% performance lift, underscoring its scalability and practical effectiveness.

Agent-Native Mid-Training Paradigm refers to a spectrum of training algorithms and data regimes for instilling agentic behaviors—such as planning, tool-use, reflection, or multi-agent coordination—directly within model or agent parameters, at an intermediate stage between pretraining and downstream post-training. Unlike pipeline-based systems, where agentic capabilities are modularized and orchestrated by external scripts or policies, agent-native mid-training emphasizes the internalization and direct optimization of agent workflows, feedback loops, and interactive environment signals, either within fixed model weights or via structured adaptive context. This paradigm encompasses a wide array of methodologies: trajectory-centric next-token modeling on agentic data, in-place reinforcement learning from agent execution, meta tool learning and self-reflection without weight updates, multi-agent mid-training with interaction-centric objectives, and function-library adaptation with model weights frozen. Across domains—software engineering, STEM, multimodal reasoning, tool-based research, and multi-agent environments—agent-native mid-training marks a methodological pivot towards agents that learn, adapt, and specialize through authentic, dynamic experience rather than static demonstrations or scripted scaffolds.

1. Defining Agent-Native Mid-Training: Foundations and Motivation

Agent-native mid-training (ANMT) is the intermediate training regime that bridges large-scale unsupervised pretraining and data-intensive supervised or reinforcement learning (RL) post-training. Its key distinction is the use of agent-native data: supervision that encodes the full action-observation-feedback sequences native to autonomous agents operating in authentic environments. In formal terms, given a policy πθ\pi_\theta, observations Obs\mathrm{Obs}, and an agent-native trajectory corpus DMT\mathcal{D}_{\text{MT}}, the objective is:

LMT(θ)=(x,y)DMTt=1ylogpθ(ytx<t)\mathcal{L}_{\rm MT}(\theta) = -\sum_{(x,y) \in \mathcal{D}_{\rm MT}} \sum_{t=1}^{|y|} \log p_\theta(y_t|x_{<t})

while ensuring that each sample (x,y)(x,y) encodes multi-step sequences {(at,ot)}t=1T\{(a_t, o_t)\}_{t=1}^T as encountered by a deployed agent (Zeng et al., 26 Jan 2026).

ANMT is motivated by the observed inefficiencies and domain gaps of both static pretraining—which lacks dynamic, interactive state-action feedback—and pure RL, which is computationally prohibitive and constrained by the base model’s representational limits. By leveraging supervised learning on authentic agentic rollouts at scale, ANMT offers a scalable, data-efficient, and task-general paradigm for injecting agentic priors and behaviors (Zeng et al., 26 Jan 2026, Lu et al., 31 Dec 2025).

2. Categories of Agent-Native Data and Trajectories

Across ANMT implementations, the foundation lies in the careful construction or synthesis of agent-native data, which falls broadly into the following categories:

Type Description Example Domains
Contextually-native Full action-observation history, navigation context, edit traces Code PR workflows (Zeng et al., 26 Jan 2026)
Environmentally-native Real environment interactive feedback (tool output, errors) Executable code edits (Zeng et al., 26 Jan 2026)
Agentic–Chain-of-Thought Structured multi-phase (plan, act, reflect) generator trajectories STEM, math, research (Lu et al., 31 Dec 2025)
Multi-agent interactive Self-play or multi-agent dialogs; role-specific decision logs Coordination, theory-of-mind (Hu et al., 9 Dec 2025)
Task-oriented machine tokens LLM-learned embedding-based message trajectories Agent communication (Xiao et al., 29 Jul 2025)
Self-Reflection & Meta-Tool Episodic reflection, tool-use logs, context augmentation Knowledge agents (Qian et al., 1 Aug 2025)
Offline function editing Incremental library synthesis/edit traces with failure feedback Symbolic reasoning, QA (Zhang et al., 2024)

For example, daVinci-Dev synthesizes both contextually-native (from pull requests and repo navigation) and environmentally-native (from in-Docker agent execution and tool feedback) code trajectories, enabling large-scale exposure to authentic agentic feedback loops (Zeng et al., 26 Jan 2026). In Youtu-LLM, agentic mid-training leverages over 200B tokens of structured math, code, research, and tool-use trajectories, each annotated with explicit phase tags (<Analysis>, <Plan>, <Action>, etc.), branching at critical action or failure points (Lu et al., 31 Dec 2025).

3. Training Algorithms and Optimization Objectives

The ANMT paradigm encompasses varying learning modalities and objectives, including:

  • Trajectory-centric next-token modeling: Standard cross-entropy loss on assistant- or agent-output tokens in agentic trajectories, often with segment masking to focus learning on agent behaviors (Lu et al., 31 Dec 2025, Zeng et al., 26 Jan 2026).
  • Reinforcement learning from native execution (Agent Lightning): Treats agent executions as Markov Decision Processes (MDPs), logging tuples (st,at,rt)(s_t, a_t, r_t) at runtime and applying hierarchical RL (e.g., GRPO) on transitions without altering agent orchestration logic (Luo et al., 5 Aug 2025).

L(θ)=ExXE(st,at)τ[j=1Ntlogπθ(yt,jst,yt,<j)(Gtbx)]\mathcal{L}(\theta) = -\,\mathbb{E}_{x\sim\mathcal{X}}\,\mathbb{E}_{(s_t, a_t)\sim\tau} \left[\,\sum_{j=1}^{N_t}\log\pi_\theta(y_{t,j}\mid s_t,y_{t,<j})\,(G_t - b_x) \right]

  • Meta tool learning and context synthesis (MetaAgent): Incorporates self-reflection and verified reflection into dynamic context banks and in-house knowledge bases to shape agent behavior without parameter updates (Qian et al., 1 Aug 2025).
  • Early experience and self-reflection (Agent Learning via Early Experience): Generates agent rollouts from current policies, applying auxiliary objectives for implicit world modeling and natural language rationale generation to improve policy grounding and reasoning (Zhang et al., 9 Oct 2025).
  • Function library optimization with frozen models: Treats function sets FF as “agent parameters”, using an LLM-based optimizer to incrementally edit FF with roll-back and early-stop mechanisms, while the core model weights remain unchanged (Zhang et al., 2024).
  • Multi-agent losses (native multi-agent mid-training): Combines understanding (theory-of-mind), joint planning, communication efficiency, and adaptation, each with specific loss functions, interleaved within minibatches according to a multi-task mixture (Hu et al., 9 Dec 2025).

4. Architectural and System Considerations

Agent-native mid-training unifies models, data, and environment interfaces:

  • Execution-agent/trainer decoupling (Agent Lightning): Direct observability frameworks allow off-policy RL algorithms to consume agent-generated transitions with negligible code overhead; native agent workflows remain untouched (Luo et al., 5 Aug 2025).
  • Multi-stage curricula: Progressive exposure to agentic trajectories is embedded as the final phase in multi-stage pretraining (e.g., commonsense → STEM → agentic) to facilitate internalization of planning and reflection (Lu et al., 31 Dec 2025).
  • Long-context support and specialized vocabularies: Architectures such as Multi-Latent Attention (MLA) and STEM-optimized vocabularies are deployed to handle long-horizon agentic sequences and reduce compression overhead (Lu et al., 31 Dec 2025).
  • Unified communication and representation (machine language tokens): Agents are trained to generate and interpret specialized token embeddings as an efficient channel for agent-to-agent or multi-modal communication, jointly optimized for task loss and embedding robustness (Xiao et al., 29 Jul 2025).

5. Empirical Evidence and Scaling Laws

Experimental studies across benchmarks and domains validate the efficacy of ANMT:

  • On SWE-Bench Verified, daVinci-Dev Qwen2.5-72B with full agent-native mid-training achieves 58.5% resolution rate, outperforming the agentless Kimi-Dev MT (48.6%) with less than half the training tokens (Zeng et al., 26 Jan 2026).
  • Youtu-LLM’s agentic mid-training yields a +14.4% average relative improvement across six downstream agentic benchmarks, with ~42.7% relative lift in code Pass@1 at k = 1 (Lu et al., 31 Dec 2025).
  • Agent Lightning demonstrates stable performance improvement when deploying RL training in situ across text-to-SQL, retrieval-augmented generation, and math tool-use agents, with no agent code modification (Luo et al., 5 Aug 2025).
  • MetaAgent’s meta tool learning elevates a frozen LLM from novice to expert-level tool reasoning without any weight updates, with ablations showing up to ~8-point drop in EM if the in-house knowledge base is omitted (Qian et al., 1 Aug 2025).
  • Agent Learning via Early Experience records +9.6 points in in-domain and +9.4 in out-of-domain generalization over imitation-only baselines, with further gains in RL-ready settings (Zhang et al., 9 Oct 2025).
  • Scaling analysis in Youtu-LLM indicates logarithmic agentic performance scaling with mid-training token size, and unsaturated learning curves with increasing scale (Lu et al., 31 Dec 2025, Zeng et al., 26 Jan 2026).

6. Limitations, Challenges, and Future Research Directions

Challenges and open areas in agent-native mid-training include:

  • Data authenticity and coverage: Ensuring the breadth (contextual diversity) and depth (authentic feedback) of agentic trajectories remains a bottleneck for non-code domains and underexplored environments (Zeng et al., 26 Jan 2026).
  • Privacy and reproducibility: Persisting developer identifiers, reliance on patched test harnesses, and single-model evaluations restrict generalizability and reproducibility in code domains (Zeng et al., 26 Jan 2026).
  • Multi-agent scaling: Pure single-agent scaling does not spontaneously yield robust multi-agent intelligence, as shown by plateaus on ToMBench and CoordinationQA without targeted multi-agent mid-training (Hu et al., 9 Dec 2025).
  • Sample efficiency and credit assignment: Advanced techniques for hierarchical credit assignment and value learning promise further improvement in RL-based agent optimization, especially in distributed or tool-rich settings (Luo et al., 5 Aug 2025).
  • Beyond RL and LLMs: Adaptive prompt optimization and modalities outside text (audio, vision, tactile) are conceptual extensions of the paradigm (Xiao et al., 29 Jul 2025, Lu et al., 31 Dec 2025).
  • Curricular and architectural synergy: The synergy of contextually- and environmentally-native data, progressive curricula, and scalable architectures underlies robust agentic skill acquisition (Lu et al., 31 Dec 2025, Zeng et al., 26 Jan 2026).

A plausible implication is that as mid-training scales in both trajectory diversity and size, the marginal gains in agentic behavior are maintained, providing a pathway to increasingly general, robust, and intrinsically agentic models.

7. Comparative Table: Key Agent-Native Mid-Training Paradigms

Approach Core Mechanism Domain Weight Update Empirical Gains
daVinci-Dev Trajectory-centric next-token modeling Code Yes Pass@1 up to 58.5% on SWE-Bench Verified (Zeng et al., 26 Jan 2026)
Youtu-LLM Scalable agentic data curriculum General/STEM Yes +14.4% avg. lift, log-linear scaling with data size (Lu et al., 31 Dec 2025)
MetaAgent Meta tool learning, context enrichment Web research No EM up to 52.1%; ablation drops up to 8 pts on EM (Qian et al., 1 Aug 2025)
Agent Lightning Agent execution → RL on MDP Any Yes Continuous improvement in SQL, RAG, math tool-use (Luo et al., 5 Aug 2025)
Early Experience Self-reflective rollout modeling General Yes +9.6 points in-domain, +9.4 OOD, improved post-RL ceiling (Zhang et al., 9 Oct 2025)
Function Editing LLM-driven library optimization Symbolic/math No +3–11% on MATH, TabMWP, GAIA (Zhang et al., 2024)
Machine Language Tokens Embedding-based comm. Multi-modal Yes Compression ratio ≈0.01, <5% accuracy loss at low SNR (Xiao et al., 29 Jul 2025)
Native Multi-Agent Multi-agent loss interleaving Multi-agent Yes Blueprint only; required to surpass coordination accuracy plateaus (Hu et al., 9 Dec 2025)

Topic to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Agent-Native Mid-Training Paradigm.