Agentic CPT: Empowering Autonomous LLMs
- Agentic CPT is a training paradigm that infuses LLMs with autonomous planning, tool use, adaptive reasoning, and multi-turn task execution during mid-training.
- It employs synthetic data pipelines—like first-order and higher-order action synthesis—to create diverse agentic trajectories that optimize behavior before alignment.
- Empirical results demonstrate significant accuracy boosts on agentic benchmarks, enabling robust emergence of agentic skills in models above critical parameter thresholds.
Agentic Mid-Training (Agentic CPT) is a specialized training regime for LLMs that aims to imbue models with agentic capabilities—such as autonomous planning, tool use, adaptive reasoning, and multi-turn task execution—by incorporating agent-centric data and optimization objectives into the mid- or continued pre-training stage, typically prior to instruction fine-tuning or reinforcement learning. This paradigm is motivated by the observation that standard pre-training benchmarks and curricula emphasize static, single-turn skills that fail to predict or elicit agentic proficiency, while post-training alone cannot resolve the optimization bottlenecks caused by the absence of upstream agentic priors.
1. Concept and Motivation
Agentic mid-training (Agentic CPT) refers to curriculum stages in which an LLM is exposed—before or interleaved with downstream alignment phases—to data, objectives, and benchmarks specifically designed to foster the agentic potential critical to real-world autonomy. In contrast to general pre-training (focused on completion of static text) or standard supervised fine-tuning/reinforcement learning (focused on alignment to gold demonstrations), agentic CPT seeks to produce base models that already possess planning, sequential action execution, and context adaptation abilities. This decoupling suppresses optimization tension, where post-training alone must teach both new behaviors and alignment simultaneously, leading to suboptimal solutions or brittle overfitting (Su et al., 16 Sep 2025).
APTBench (Qin et al., 28 Oct 2025) empirically demonstrates that general pre-training benchmarks (MMLU, GSM8K, EvalPlus) correlate weakly or negatively with downstream agentic benchmark performance, whereas agentic-specific mid-training metrics (e.g., APTBench-SWE, APTBench-DR) are strong predictors. Agentic ability only emerges above critical model sizes (generally >4B params), necessitating both architectural capacity and aligned data.
2. Benchmarking and Measurement
APTBench (Qin et al., 28 Oct 2025) introduces a conversion pipeline to make agent tasks accessible to base models lacking mature instruction-following capabilities:
- Task & Trajectory Collection: Aggregate successful human/agent trajectories in domains such as software engineering (issue fixing, environment setup) and research (open/closed QA).
- Question Formulation: Decompose multi-turn tasks into multiple-choice or text-completion queries isolating agentic abilities—planning (task decomposition, dynamic adjustment), action (tool use commands), and atomic scenario-specific skills (bug localization, citation generation).
- Distractor Generation: Employ LLMs to degrade correct actions, shuffle steps, or introduce logic errors for MCQ distractors; require exact-match completion for tool invocations.
- Human Validation: All data is manually checked post-generation.
APTBench provides coverage over both standard-length and long-context problems (>16k tokens), with accuracy (ACC), exact match (EM), and ROUGE as metrics, probing the model's capacity for long-range reasoning and scenario fidelity.
| Domain | Task | Assessed Abilities | Format |
|---|---|---|---|
| Software Engineering | EnvSetup, IssueFix | Planning, Action, Atomic | MCQ, TC |
| Deep Research | Closed/Open QA | Planning, Action, Atomic | MCQ, TC |
APTBench’s main finding is that its scores are highly correlated with instruct-model performance on end-to-end agent benchmarks such as SWE-bench Verified, whereas general-purpose benchmarks are not (Qin et al., 28 Oct 2025).
3. Curriculum and Data Synthesis Strategies
Agentic CPT leverages synthetic data pipelines to produce both breadth and depth in agentic behavior:
- First-order Action Synthesis (FAS): Generate (question, plan, action) tuples capturing initial steps, scenario anchoring, and diverse problem instantiations, without requiring API-based tool calls (Su et al., 16 Sep 2025).
- Higher-order Action Synthesis (HAS): Synthesize full multi-step trajectories, including alternative reasoning and action branches at each step, contrastive selection, and ground-truth actions, enabling decision-level discrimination.
- Agentic Policy Decomposition: For complex policy settings (e.g., business process agents), category-aware parsing and simulation data are used to pre-train models on factual, behavioral, and complex conditional rules, greatly reducing the reliance on prompt-length and direct in-context provisioning (Liu et al., 13 Oct 2025).
- Automated Filtering: LLM-judge or rubric-based rejection sampling is consistently applied to maintain data quality and semantic agreement with known ground truth.
The CPT phase typically uses the standard autoregressive loss over the synthesized agentic data:
Key to successful agentic CPT is diversity—across domains, action styles, and trajectory structure—to avoid overfitting to a narrow class of agentic phenomena. For policy document internalization, targeted scenario-simulation and chain-of-thought data is synthesized per-category, yielding up to 97.3% prompt length reduction with improved generalization and robustness (Liu et al., 13 Oct 2025).
4. Infrastructure, Optimization, and Scaling
Agentic mid-training requires specialized system architectures. Key system-level advances include:
- Parallelization and Resource Management (Tan et al., 7 Oct 2025):
- Exploding context lengths in agentic RL training (multi-turn, tool-use tasks) necessitate dynamic parallelism: parallelism selectors adapt tensor/model sharding depth to maintain throughput and prevent OOM failures as context grows (e.g., >32K tokens).
- Decentralized data dispatchers (layout-aware, all-to-all) supplant centralized data gathering, eliminating bottlenecks from massive intermediate tensor exchanges.
- Impact: Up to reduction in communication latency at long context scales.
- Distributed Rollout Systems (Yu et al., 28 Aug 2025):
- Rollout acceleration via distributed agent-environment pods (Kubernetes orchestration), achieving 14.6 speedup versus single-node sequential execution, making complex RL-based mid-training feasible for very large agents.
- State and trace management systems facilitate high-throughput, failure-tolerant execution, enabling curriculum learning across diverse environments and continuous audit of behaviors.
- RL Algorithmic Choices (Yu et al., 13 Oct 2025):
- Group Relative Policy Optimization (GRPO), often with extensions such as RoC (Resample-on-Correct), is used for credit assignment in high-variance, tool-use-heavy environments.
- Key practices include “clip higher” (permissive policy updates for enhanced exploration), reward shaping for output length/tool-use efficiency, and token-level advantage aggregation to match exploration capacity to model scale.
5. Empirical Results and Emergence Phenomena
Agentic CPT consistently yields:
- Correlation with Real-World Agent Tasks: APTBench (Qin et al., 28 Oct 2025) and similar mid-training evaluations outpredict downstream agentic benchmark performance as compared to static skill metrics.
- Agentic Ability Thresholds: Emergence of agentic behavior occurs only in models above threshold parameter count (e.g., >4B), with failure in smaller models even when static benchmarks are passed.
- Performance Gains: Models with agentic CPT show 5–14 percentage point accuracy boosts on agentic benchmarks over fine-tuning from plain pre-trained checkpoints (Su et al., 16 Sep 2025). In policy internalization scenarios, CPT plus minimal SFT outperforms SFT alone by 22–41% under strong complexity, shrinking prompt length up to 97% (Liu et al., 13 Oct 2025). RL-based CPT recipes enable small models (4B) to rival or outperform models 8 larger (Yu et al., 13 Oct 2025, Shang et al., 28 Aug 2025).
- Discriminative Capability: Agentic benchmarks reveal granular architectural and data scaling effects not visible in generic evaluation, allowing precise measurement of agentic emergence and alignment.
| Model | APTBench Corr. | Agentic SFT Gain | Policy Length Red. | Small Model Superiority |
|---|---|---|---|---|
| Qwen/AgentFounder | High | +5–14% | 97% (policy) | Yes (w/ proper CPT/RL) |
| DeepSeek-R1, rStar2 | – | – | – | 4B matches 32B with RL |
6. Implications and Future Directions
Agentic CPT is now established as a "scaling layer" that bridges base model capacity and post-training alignment for agentic LLMs. Its methodological innovations:
- Enable the emergence and measurement of agentic skills before alignment, decoupling capability learning from SFT/RL, and resolving optimization tension (Su et al., 16 Sep 2025, Qin et al., 28 Oct 2025).
- Allow for scalable, targeted, and lightweight evaluation and curriculum design, avoiding brittle dependence on instruction-following frameworks or massive prompt context.
- Provide a foundation for robust, general-purpose agentic foundation models in the open-source community, previously unattainable with post-training alone.
- Open pathways for new alignment strategies: modular policy specification internalization (Liu et al., 13 Oct 2025), secondary RL or deliberate planning RL (Yu et al., 13 Oct 2025), and systematic infrastructure improvement (Tan et al., 7 Oct 2025).
Future work is focused on deeper agent-environment co-evolution (environment scaling), dynamic curriculum construction, and efficient scaling to ultra-long context and multi-agent scenarios. Agentic CPT is considered critical for the next generation of robust, adaptive, and general agentic LLMs.
7. Comparative Summary Table
| Dimension | General Pre-training | Agentic CPT (Mid-training) | Standard SFT/RL Post-training |
|---|---|---|---|
| Target Task | Static, single-turn | Agentic, multi-turn, tool use | Alignment or imitation |
| Benchmark Corr. | Weak | High (w.r.t agentic tasks) | Variable |
| Data | Web/corpus | Synthetic, scenario/trajectory | Human/LLM demo, reward |
| Compute Cost | High | Medium (short evals, lightweight) | High on real rollouts |
| Emergence | Delayed | Observed at smaller scale | Model-size/recipe-bound |
| Alignment Tension | Severe | Removed/decoupled | Strong |
Agentic Mid-Training (Agentic CPT) thus constitutes both a methodology and an accompanying set of evaluation standards for developing LLMs genuinely capable of goal-conditional, autonomous behavior, as validated by both empirical gains and the emergence of robust, generalizable agentic skills.