Agentic CPT: Pre-training for Autonomous Agents

Updated 17 September 2025

Agentic CPT is a continual pre-training paradigm that embeds agent behaviors like planning, tool use, and self-improvement into models.
It leverages synthetic and experience-generated agent trajectories to incorporate long-context multi-step reasoning and enhance alignment efficiency.
The approach employs tiered data synthesis and architectural tweaks, such as MoE and extended context windows, to mitigate optimization conflicts.

Agentic Continual Pre-training (Agentic CPT) is a paradigm and associated methodological framework that integrates agent-like behaviors—such as multi-step reasoning, adaptive tool use, autonomous action planning, and self-improvement—directly into a language or decision model’s continual pre-training cycle. Agentic CPT is distinct from conventional continual pre-training by explicitly endowing large models with agentic priors, using either synthetic or experience-generated data streams, before any supervised fine-tuning or reinforcement learning for alignment. This foundation is designed to resolve optimization conflicts inherent in post-hoc alignment of non-agentic models, enabling more robust, adaptable, and generalizable agentic systems suited for complex, dynamic, real-world environments.

1. Core Definition and Motivation

Agentic CPT is introduced as an intermediate pre-training stage positioned between initial foundation model pre-training and downstream alignment steps (e.g., supervised fine-tuning, RL) (Su et al., 16 Sep 2025). Its objective is to pre-align models on simulated or synthesized agentic data reflecting behaviors such as decision trace planning, external tool invocation, and long-horizon multi-step reasoning. Critically, Agentic CPT addresses fundamental optimization tensions in post-training-based agentic pipelines: rather than forcing models to learn both agentic competence and expert alignment from sparse downstream demonstrations, CPT leverages large-scale, curated agentic experiences to “bias” the model toward robust agentic function prior to SFT or RL.

The methodology typically involves:

Leveraging large volumes of synthetic or experience-based agent trajectory data, constructed via First-order Action Synthesis (FAS) or Higher-order Action Synthesis (HAS).
Adopting long-context windows (e.g., 32K–128K tokens) to encode full agentic workflows.
Training on next-token prediction objectives over these agent trajectories, so that agentic priors are directly embedded in the foundation model’s parametric space.

The motivating hypothesis is that a model with agentic CPT can disentangle capability acquisition (general agentic behaviors and situational competence) from alignment (matching expert demonstration style and correctness), leading to higher performance and lower SFT loss—especially in open-domain and compositional tasks (Su et al., 16 Sep 2025).

2. Data Synthesis Methods for Agentic CPT

Agentic data construction is central to the efficacy of CPT. Two principal methodologies are employed in recent research:

First-Order Action Synthesis (FAS): FAS generates question–action–reasoning tuples from open corpora, without explicit tool invocation. The resulting agentic sequence might include multi-style questions and structured chains of planning, simulating agentic traces typical for complex tasks such as web navigation or information retrieval.

Higher-Order Action Synthesis (HAS): HAS expands agentic data to encompass full decision processes, including action branching. For each agentic step, multiple candidate actions are generated, then permuted and associated with contrastive signals, allowing the model to learn nuanced decision trade-offs and to cover a wider action-permutation space.

During CPT, these data modalities are interleaved and padded into long contexts, with agentic windows progressively extended in later training stages (e.g., 32K → 128K tokens), enhancing the model’s ability to track deep, multi-step reasoning and context-dependent long-horizon dependencies.

The training objective remains next-token prediction over the synthesized trajectories:

$\mathcal{L} = - \sum_{t=1}^{T} \log P(x_{t+1} | x_1, x_2, ..., x_t)$

where $x_t$ includes both action specification and reasoning content.

3. Pipeline Structure and Implementation

A canonical Agentic CPT pipeline consists of the following phases (Su et al., 16 Sep 2025):

Phase	Data Volume	Context Length	Agentic Features
Base Pre-training	trillions	2K–4K tokens	Generic text; no agentic exposure
Agentic CPT-1	~200B tokens	~32K tokens	Basic planning, tool/action traces
Agentic CPT-2	~100B tokens	up to 128K tokens	High-quality, long-horizon agentic
Post-training	10B–100B	variable	SFT, RL alignment, reward feedback

During each CPT stage, data is shuffled and sampled to maximize coverage of agentic behaviors. Optimization follows standard language modeling objectives, although architectural adjustments—such as depth, width, and long-sequence attention—improve downstream agentic capacity (Liu et al., 2023). Notably, mixture-of-expert (MoE) architectures have shown strong scaling and robustness under CPT conditions (Thérien et al., 6 Mar 2025).

4. Evaluation, Benchmarking, and Impact

Agentic CPT models demonstrate substantial empirical improvements on agentic benchmarks. For example, AgentFounder-30B, continually pre-trained via two-stage agentic CPT, attained:

Benchmark	Pass@1 Score
BrowseComp-en	39.9%
BrowseComp-zh	43.3%
HLE	31.5%
GAIA	72.8%

Competing models trained solely with SFT or RL on foundation models consistently underperformed in agentic and tool-use-specific tasks. SFT loss during downstream alignment also significantly decreases when starting from an agentically pre-trained foundation, suggesting more effective parameter adaptation and lower optimization barriers (Su et al., 16 Sep 2025).

Scaling studies further indicate regular improvements in agentic task performance as both agentic data diversity and model parameter count increase (Liu et al., 2023).

5. Architectural and Optimization Aspects

Optimal effect of Agentic CPT is achieved with models designed for long-context reasoning and modular capacity scaling. Key considerations include:

Use of decoder-only GPT architectures or MoEs for efficient context handling and expert activation (Thérien et al., 6 Mar 2025, Liu et al., 2023).
The design of training curricula: staged exposure to increasingly complex agentic sequences, ensuring gradual adaptation from generic to high-fidelity agentic behaviors.
Replay mechanisms and learning rate annealing for stability under domain shift, mitigating catastrophic forgetting while promoting domain adaptation (Thérien et al., 6 Mar 2025, Wang et al., 12 May 2025).
Empirically derived scaling laws for predicting loss dynamics and domain trade-off (general vs domain-specific performance), enabling targeted hyperparameter customization during CPT (Wang et al., 12 May 2025).

6. Implications, Limitations, and Future Work

Agentic CPT fundamentally alters the training pipeline for agentic LLMs by separating agentic competence learning from demonstration alignment, alleviating major optimization bottlenecks. Its success motivates several future research directions:

Expanding the scope and realism of agentic data synthesis (e.g., integrating multimodal or more diverse real-world sources).
Systematic exploration of the trade-off between agentic context length and inference-time performance in very long-horizon tasks (Su et al., 16 Sep 2025).
Integrating CPT with environment scaling and simulated multi-agent ecosystems to further diversify the agentic experience base (Fang et al., 16 Sep 2025).
Development of tailored evaluation benchmarks focusing on emergent agentic behaviors, function-calling robustness, and tool-use generalization.
Investigating the interplay of continual pre-training with interactive feedback and online adaptation for lifelong agentic learning.

A plausible implication is that Agentic CPT may support the emergence of robust, adaptable research agents, autonomous problem-solvers, and information synthesis agents capable of handling increasingly open-ended, real-world environments at scale.