Papers
Topics
Authors
Recent
Search
2000 character limit reached

Agentic Mid-training: Bridging Imitation and RL

Updated 3 January 2026
  • Agentic mid-training is a paradigm that bridges language modeling, imitation learning, and reinforcement learning by abstracting temporally-extended actions.
  • It employs methodologies like the RA3 algorithm to alternate between RL bootstrapping and supervised fine-tuning, reducing sample complexity.
  • Empirical results demonstrate substantial gains in code generation and reasoning tasks, validating the efficiency of action abstraction in agentic systems.

Agentic mid-training is a paradigm and methodology designed to bridge the gap between language modeling, imitation learning, and fully agentic reinforcement learning (RL) for LLMs and agentic systems. By abstracting high-level actions and learning temporally-extended behaviors through mid-stage training protocols, agentic mid-training enables models to acquire compact and generalizable action spaces, facilitating robust online RL and accelerating the acquisition of complex reasoning, planning, and tool-use capabilities.

1. Formal Foundations and Motivation

Agentic mid-training formalizes the transition from sequence-level imitation to action-abstraction-driven RL. Tasks are modeled as Markov Decision Processes (MDPs) M=(S,A,R,γ)M = (S, A, R, \gamma), where SS denotes the state space (e.g., code prefixes), AA the atomic action space (e.g., token selections), RR is the reward, and γ\gamma the discount factor. Standard next-token prediction (NTP) corresponds to imitation learning:

JNTP(π)=E(s0:T,a0:T)DE[t=0Tlogπ(atst)]J_{\mathrm{NTP}}(\pi) = \mathbb{E}_{(s_{0:T}, a_{0:T}) \sim D_E}\left[ \sum_{t=0}^T \log \pi(a_t | s_t) \right]

Agentic mid-training departs from pure NTP by introducing a compact, temporally-extended action space ZZ, where each zZz \in Z may implement subpolicies or macro-actions spanning multiple atomic steps (duration τ\tau). The objective is to select a minimal yet sufficient subset Z^Z\widehat{Z} \subset Z such that subsequent online RL can operate effectively within this reduced space, facilitating policy optimization over options, rationales, or latent abstractions rather than primitive tokens (Zhang et al., 30 Sep 2025).

2. Theoretical Characterization of Performance

The efficacy of agentic mid-training is characterized by its impact on regret decomposition following post-training RL:

SS0

SS1

where SS2 quantifies the value-approximation (pruning) error induced by restricting actions to SS3, and the second term is the RL error within the pruned space.

Key formal results:

  • The minimal SS4-optimal SS5 controls sample complexity: with SS6 expert demonstrations, all SS7-suboptimal actions are pruned with probability SS8, yielding SS9.
  • RL convergence in the abstracted MDP AA0 is governed by effective discount AA1, with contraction rate AA2 and sample complexity AA3 (Zhang et al., 30 Sep 2025).

These results demonstrate that pruning efficiency (AA4) and horizon-shortening via longer-temporal abstractions (AA5) synergistically reduce expert data and RL updates.

3. Agentic Mid-Training Methodologies and Algorithms

3.1 Latent-Option Extraction and RA3

The Reasoning as Action Abstractions (RA3) algorithm exemplifies agentic mid-training (Zhang et al., 30 Sep 2025). RA3 frames mid-training as an EM algorithm alternating between:

  • E-Step (RL Bootstrap): Discover temporally-consistent latent variables AA6 by solving an RL problem over expert trajectories with reward AA7, where AA8 penalizes unnecessary rationale switches.
  • M-Step (Supervised Fine-tuning): Optimize AA9 via next-token prediction on bootstrapped RR0 trajectories.

This process is underpinned by the temporal ELBO:

RR1

The prior RR2 encourages persistence in high-level reasoning states.

3.2 Data-Centric and RL Approaches

Empirical recipes for agentic mid-training, as in (Yu et al., 13 Oct 2025), stress the importance of initializing supervised fine-tuning (SFT) with real, end-to-end agentic trajectories (e.g., tool-use with verified recovery and reflection) over synthetic stitched data. High-diversity, model-aware RL datasets—stratifying by difficulty and domain—sustain exploration and accelerate RL-driven refinement using GRPO/PPO-style objectives with exploration-encouraging entropy and shaped rewards.

3.3 Curriculum and Massive-Scale Pipelines

In curriculum-driven setups, stagewise data distribution shifts, as in Youtu-LLM, employ initial commonsense/STEM pre-training, followed by an agentic mid-training phase dominated by structured, multi-domain agentic trajectories (e.g., planning, code, tool use), leveraging masking and input-formatting strategies to prevent noise propagation. This enables even lightweight LLMs to internalize high-level agentic schemas (Lu et al., 31 Dec 2025).

4. Data Construction and Action Abstraction Granularity

Agentic mid-training critically depends on constructing action/trajectory spaces that balance expressivity and compactness:

  • Action abstraction: Temporally-extended options, rationales, or macro-actions enable RL over skills rather than primitives, reducing the effective planning horizon and sample complexity (Zhang et al., 30 Sep 2025).
  • Trajectory diversity: Datasets span code, mathematics, deep research, tool-use, and reflection, formatted as XML/JSON segments with structured stages (e.g., <Analysis>, <Plan>, <Action>, <Reflection>, <Summary>). Data may be generated by strong teacher models or via adversarial User/Assistant LLMs with rigorous error-checking and negative augmentation (Lu et al., 31 Dec 2025).
  • Abstraction control: Hyperparameters such as the KL penalty RR3 (or prior persistence RR4) provide a direct handle for tuning granularity—higher RR5 biases toward fewer, longer options; lower RR6 permits finer-grained reasoning (Zhang et al., 30 Sep 2025).

5. Empirical Effects, Benchmarks, and Architectural Integration

Agentic mid-training consistently yields substantial performance and efficiency improvements across multiple agentic domains:

  • Code generation: RA3 improves average pass@1 by ≈4–8 points over NTP and base models on HumanEval, MBPP, and derived datasets; cross-entropy loss is also consistently lower (Zhang et al., 30 Sep 2025).
  • Reasoning benchmarks: With high-diversity data and calibrated RL protocols, compact models (4B) achieve >70% average@32 on AIME24/25, rivaling or exceeding 32B-parameter agents (Yu et al., 13 Oct 2025).
  • Long-context and lightweight models: Agentic mid-training (e.g., 200B tokens on Youtu-LLM) produces up to +42.7% improvement on SWE-Bench-Verified (k=1), +13.7% on APTBench, and enables strong planning, reflection, and tool-use even in sub-2B models (Lu et al., 31 Dec 2025).
  • Architectural adaptation: Dense multi-latent attention (MLA), XL-context support (128k+), and STEM-oriented tokenizers are used in conjunction with mid-training, with masking and prefix-sharing (tree training) to minimize computational overhead (Lu et al., 31 Dec 2025, Wang et al., 1 Nov 2025).

Empirical results support the sufficiency of relatively small expert datasets, provided the action space is aggressively abstracted and pruned, and the RL curriculum exploits domain-specific granularity.

6. Integration with RL Pipelines and Infrastructure

Agentic mid-training is increasingly realized as part of large-scale distributed training architectures, exemplified by frameworks such as AWorld (Yu et al., 28 Aug 2025):

  • System design: Two-tiered orchestration with high-concurrency rollout-executors (sandboxed agent/environment pairs) and separate training clusters, enabling near-linear scaling to 16-32 parallel pods.
  • Distributed RL: Experience collection and online RL are decoupled; rollouts are streamed, and rewards (including sparse and stepwise variants) are computed asynchronously.
  • Practical workflow: Mid-training is initiated with SFT on real trajectories, followed by GRPO/PPO-based RL, with careful tuning of batch sizes, learning rates, and exploration strategies.

Such infrastructure is critical for rendering mid-training tractable at the scale required for highly agentic benchmarks (GAIA, LiveCodeBench, etc.).

7. Practical Guidelines and Open Directions

Standardized best practices for agentic mid-training include:

  • Begin with real, multi-turn agentic SFT data; avoid exclusively synthetic "stitched" trajectories (Yu et al., 13 Oct 2025).
  • Employ curriculum learning, with mid- and late-training data emphasizing agentic and high-variance trajectories.
  • Employ action abstraction and pruning to compactify the decision space; tune abstraction granularity with hyperparameters (KL penalty, RR7).
  • Integrate tree training to reuse shared prefixes for efficiency (Wang et al., 1 Nov 2025).
  • In multi-domain setups, use per-domain RL specialization, model merge (SCE), or joint RL with schedule balancing (Wang et al., 8 Nov 2025).
  • Monitor policy entropy and use exploration-enhancing PPO/GRPO schedules.

Emerging directions include partial cross-GPU shared-prefix reuse, more adaptive abstraction-discovery protocols, and refined environment-aware reward modeling.


Key References:

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Agentic Mid-training.