Agentic Mid-training: Bridging Imitation and RL
- Agentic mid-training is a paradigm that bridges language modeling, imitation learning, and reinforcement learning by abstracting temporally-extended actions.
- It employs methodologies like the RA3 algorithm to alternate between RL bootstrapping and supervised fine-tuning, reducing sample complexity.
- Empirical results demonstrate substantial gains in code generation and reasoning tasks, validating the efficiency of action abstraction in agentic systems.
Agentic mid-training is a paradigm and methodology designed to bridge the gap between language modeling, imitation learning, and fully agentic reinforcement learning (RL) for LLMs and agentic systems. By abstracting high-level actions and learning temporally-extended behaviors through mid-stage training protocols, agentic mid-training enables models to acquire compact and generalizable action spaces, facilitating robust online RL and accelerating the acquisition of complex reasoning, planning, and tool-use capabilities.
1. Formal Foundations and Motivation
Agentic mid-training formalizes the transition from sequence-level imitation to action-abstraction-driven RL. Tasks are modeled as Markov Decision Processes (MDPs) , where denotes the state space (e.g., code prefixes), the atomic action space (e.g., token selections), is the reward, and the discount factor. Standard next-token prediction (NTP) corresponds to imitation learning:
Agentic mid-training departs from pure NTP by introducing a compact, temporally-extended action space , where each may implement subpolicies or macro-actions spanning multiple atomic steps (duration ). The objective is to select a minimal yet sufficient subset such that subsequent online RL can operate effectively within this reduced space, facilitating policy optimization over options, rationales, or latent abstractions rather than primitive tokens (Zhang et al., 30 Sep 2025).
2. Theoretical Characterization of Performance
The efficacy of agentic mid-training is characterized by its impact on regret decomposition following post-training RL:
where quantifies the value-approximation (pruning) error induced by restricting actions to , and the second term is the RL error within the pruned space.
Key formal results:
- The minimal -optimal controls sample complexity: with expert demonstrations, all -suboptimal actions are pruned with probability , yielding .
- RL convergence in the abstracted MDP is governed by effective discount , with contraction rate and sample complexity (Zhang et al., 30 Sep 2025).
These results demonstrate that pruning efficiency () and horizon-shortening via longer-temporal abstractions () synergistically reduce expert data and RL updates.
3. Agentic Mid-Training Methodologies and Algorithms
3.1 Latent-Option Extraction and RA3
The Reasoning as Action Abstractions (RA3) algorithm exemplifies agentic mid-training (Zhang et al., 30 Sep 2025). RA3 frames mid-training as an EM algorithm alternating between:
- E-Step (RL Bootstrap): Discover temporally-consistent latent variables by solving an RL problem over expert trajectories with reward , where penalizes unnecessary rationale switches.
- M-Step (Supervised Fine-tuning): Optimize via next-token prediction on bootstrapped trajectories.
This process is underpinned by the temporal ELBO:
The prior encourages persistence in high-level reasoning states.
3.2 Data-Centric and RL Approaches
Empirical recipes for agentic mid-training, as in (Yu et al., 13 Oct 2025), stress the importance of initializing supervised fine-tuning (SFT) with real, end-to-end agentic trajectories (e.g., tool-use with verified recovery and reflection) over synthetic stitched data. High-diversity, model-aware RL datasets—stratifying by difficulty and domain—sustain exploration and accelerate RL-driven refinement using GRPO/PPO-style objectives with exploration-encouraging entropy and shaped rewards.
3.3 Curriculum and Massive-Scale Pipelines
In curriculum-driven setups, stagewise data distribution shifts, as in Youtu-LLM, employ initial commonsense/STEM pre-training, followed by an agentic mid-training phase dominated by structured, multi-domain agentic trajectories (e.g., planning, code, tool use), leveraging masking and input-formatting strategies to prevent noise propagation. This enables even lightweight LLMs to internalize high-level agentic schemas (Lu et al., 31 Dec 2025).
4. Data Construction and Action Abstraction Granularity
Agentic mid-training critically depends on constructing action/trajectory spaces that balance expressivity and compactness:
- Action abstraction: Temporally-extended options, rationales, or macro-actions enable RL over skills rather than primitives, reducing the effective planning horizon and sample complexity (Zhang et al., 30 Sep 2025).
- Trajectory diversity: Datasets span code, mathematics, deep research, tool-use, and reflection, formatted as XML/JSON segments with structured stages (e.g., <Analysis>, <Plan>, <Action>, <Reflection>, <Summary>). Data may be generated by strong teacher models or via adversarial User/Assistant LLMs with rigorous error-checking and negative augmentation (Lu et al., 31 Dec 2025).
- Abstraction control: Hyperparameters such as the KL penalty (or prior persistence ) provide a direct handle for tuning granularity—higher biases toward fewer, longer options; lower permits finer-grained reasoning (Zhang et al., 30 Sep 2025).
5. Empirical Effects, Benchmarks, and Architectural Integration
Agentic mid-training consistently yields substantial performance and efficiency improvements across multiple agentic domains:
- Code generation: RA3 improves average pass@1 by ≈4–8 points over NTP and base models on HumanEval, MBPP, and derived datasets; cross-entropy loss is also consistently lower (Zhang et al., 30 Sep 2025).
- Reasoning benchmarks: With high-diversity data and calibrated RL protocols, compact models (4B) achieve >70% average@32 on AIME24/25, rivaling or exceeding 32B-parameter agents (Yu et al., 13 Oct 2025).
- Long-context and lightweight models: Agentic mid-training (e.g., 200B tokens on Youtu-LLM) produces up to +42.7% improvement on SWE-Bench-Verified (k=1), +13.7% on APTBench, and enables strong planning, reflection, and tool-use even in sub-2B models (Lu et al., 31 Dec 2025).
- Architectural adaptation: Dense multi-latent attention (MLA), XL-context support (128k+), and STEM-oriented tokenizers are used in conjunction with mid-training, with masking and prefix-sharing (tree training) to minimize computational overhead (Lu et al., 31 Dec 2025, Wang et al., 1 Nov 2025).
Empirical results support the sufficiency of relatively small expert datasets, provided the action space is aggressively abstracted and pruned, and the RL curriculum exploits domain-specific granularity.
6. Integration with RL Pipelines and Infrastructure
Agentic mid-training is increasingly realized as part of large-scale distributed training architectures, exemplified by frameworks such as AWorld (Yu et al., 28 Aug 2025):
- System design: Two-tiered orchestration with high-concurrency rollout-executors (sandboxed agent/environment pairs) and separate training clusters, enabling near-linear scaling to 16-32 parallel pods.
- Distributed RL: Experience collection and online RL are decoupled; rollouts are streamed, and rewards (including sparse and stepwise variants) are computed asynchronously.
- Practical workflow: Mid-training is initiated with SFT on real trajectories, followed by GRPO/PPO-based RL, with careful tuning of batch sizes, learning rates, and exploration strategies.
Such infrastructure is critical for rendering mid-training tractable at the scale required for highly agentic benchmarks (GAIA, LiveCodeBench, etc.).
7. Practical Guidelines and Open Directions
Standardized best practices for agentic mid-training include:
- Begin with real, multi-turn agentic SFT data; avoid exclusively synthetic "stitched" trajectories (Yu et al., 13 Oct 2025).
- Employ curriculum learning, with mid- and late-training data emphasizing agentic and high-variance trajectories.
- Employ action abstraction and pruning to compactify the decision space; tune abstraction granularity with hyperparameters (KL penalty, ).
- Integrate tree training to reuse shared prefixes for efficiency (Wang et al., 1 Nov 2025).
- In multi-domain setups, use per-domain RL specialization, model merge (SCE), or joint RL with schedule balancing (Wang et al., 8 Nov 2025).
- Monitor policy entropy and use exploration-enhancing PPO/GRPO schedules.
Emerging directions include partial cross-GPU shared-prefix reuse, more adaptive abstraction-discovery protocols, and refined environment-aware reward modeling.
Key References:
- "Learning to Reason as Action Abstractions with Scalable Mid-Training RL" (Zhang et al., 30 Sep 2025)
- "Demystifying Reinforcement Learning in Agentic Reasoning" (Yu et al., 13 Oct 2025)
- "AWorld: Orchestrating the Training Recipe for Agentic AI" (Yu et al., 28 Aug 2025)
- "Youtu-LLM: Unlocking the Native Agentic Potential for Lightweight LLMs" (Lu et al., 31 Dec 2025)
- "Tree Training: Accelerating Agentic LLMs Training via Shared Prefix Reuse" (Wang et al., 1 Nov 2025)
- "Agent Fine-tuning through Distillation for Domain-specific LLMs in Microdomains" (Xue et al., 1 Oct 2025)
- "Klear-AgentForge: Forging Agentic Intelligence through Posttraining Scaling" (Wang et al., 8 Nov 2025)