Post-Training Dynamics in Foundation Models
- Post-training dynamics is the process by which models evolve following initial pretraining, reshaping probability distributions and internal states to achieve enhanced adaptability.
- It leverages off-policy and on-policy methods to expand state support, reshape policies, and consolidate beneficial behaviors via techniques such as SFT, RLHF, and distillation.
- Research shows that targeted update locality and coordinated structural changes, like singular vector rotations, underlie improved generalization, retention, and deployment across domains.
Post-training dynamics refers to the evolution of model behavior and internal state distributions as LLMs or other foundation models undergo additional fine-tuning, alignment, or adaptation stages after their initial broad pretraining. This process encompasses how probability mass over tokens and trajectories shifts, how previously unreachable behaviors become possible, and how these transformations interact across sequential or hybrid pipelines. Post-training dynamics is central to understanding model alignment, generalization, retention, and practical deployability across domains including language, biology, robotics, and video generation.
1. Regimes of Post-Training: Trajectory Provenance
A primary axis in post-training methodology is trajectory provenance—whether learning proceeds off-policy (on fixed, externally supplied trajectories) or on-policy (using learner-generated rollouts). The dynamics, bottlenecks, and behavioral shifts are often better understood through this lens than by the surface loss/objective (Zhao et al., 9 Apr 2026).
Off-Policy Learning: Training updates are computed on dataset trajectories sampled from an external process, such as human demonstrations, high-quality responses, teacher models, or preference datasets. Formally, the update objective takes the form
Examples include supervised fine-tuning (SFT) and offline preference optimization. These interventions can induce entirely new reasoning paths or styles inaccessible to the original policy .
On-Policy Learning: Updates leverage trajectories actively sampled from the learner itself, enabling correction of rollout-dependent errors. On-policy RL is characterized by objectives such as: with reward and an entropy bonus . This regime addresses compounding error, local behavioral failures, and unlocks exploration beyond the replayed demonstration support.
Hybrid and Multi-stage Pipelines: Practical post-training often composes off-policy support expansion (SFT) with on-policy policy reshaping (RLHF, process supervision), followed by behavioral consolidation (distillation) to preserve and amortize gains. Skipping any stage risks unreachable behaviors or catastrophic forgetting (Zhao et al., 9 Apr 2026).
2. Functional Roles: Support Expansion, Policy Reshaping, Behavioral Consolidation
Post-training stages are classified by their population-level effect on the induced state–action distribution :
- Effective Support Expansion: Previously negligible state–action pairs () become reachable () due to imported demonstrations or guided on-policy rollouts. For example, SFT on chain-of-thought traces enlarges the trajectory manifold.
- Policy Reshaping: Probability mass is redistributed within existing support regions (i.e., for states already reachable pre-intervention). Preference optimization (DPO/Bradley-Terry, RL on current prefixes) enhances solution quality and aligns output preferences with human or verifier-derived signals.
- Behavioral Consolidation: Transfers and preserves beneficial behaviors acquired via costly off/on-policy interventions into a compact, deployable model (student), often via off-policy distillation. This stage is critical for amortizing exploration or scaffold-dependent improvements (Zhao et al., 9 Apr 2026).
A stylized pipeline proceeds: base→SFT (expands support)→RL (reshapes within new support)→distillation (consolidates for deployment).
3. State Distribution and Update Locality
Recent work formalizes post-training not through token-level objectives, but as intentional shaping of the state-distribution that the learner inhabits (Nie, 21 May 2026). In autoregressive models, a “state” is a prompt plus generated prefix. SFT applies dense updates off-policy (fixed data states), often risking forgetting when distributional overlap is low. On-policy RL and on-policy distillation apply updates exclusively at states the learner actually visits, confining change to reachable regions and minimizing catastrophic interference with unrelated capabilities. The locality of update—i.e., whether supervision is injected at states likely to be encountered—and the form of the signal (tokens, preference, reward, teacher logits) jointly determine retention and generalization properties.
Systematic experiments confirm that on-policy post-training preserves capabilities even under significant aggregate parameter drift, and that students can surpass degraded teachers under on-policy distillation, simply by localizing corrections to enterable state subspaces (Nie, 21 May 2026).
4. Quantitative and Structural Theories: Scaling Laws and Plasticity-Ceiling Decomposition
Scaling Laws and Unified Objectives:
Analyses have shown that both SFT and RL objectives can be unified as maximizing expected reward with a KL anchor to demonstration policy, and that the corresponding gradients decompose into interchangeable normalization, reference, advantage, and stabilization components (Lv et al., 4 Sep 2025). Hybrid algorithms can dynamically blend SFT and RL signals according to the model’s current proficiency, yielding stable, continual improvement and efficient exploitation–exploration trade-offs.
Plasticity–Ceiling Framework:
Ultimate post-training performance decomposes into
0
where SFT establishes a foundation and RL unlocks additional “plasticity.” Optimal pipeline design requires switching to RL when SFT validation loss is in its stable or mild-overfitting regime—further SFT induces overfitting and collapses RL plasticity. Data scale and trajectory difficulty both scale the achievable post-training ceiling, and sequential SFT→RL pipelines outperform concurrent or pure-RL paradigms across all tested reasoning domains (Ding et al., 12 Dec 2025).
Bias–Variance in Data Regularization:
Dr. Post-Training reframes the integration of scarce, high-fidelity (“target”) and abundant, noisy (“general”) data as regularization in gradient space. By constraining update directions based on general gradients, the method interpolates between high-variance, target-only (flexible but brittle) and low-variance, general-only (biased) post-training, achieving bias–variance trade-off control at LLM scale (Hu et al., 8 May 2026).
5. Post-Training Dynamics in Specific Domains
Language: Value Drift, Forgetting, and Cross-lingual Transfer
- Value Drifts: SFT rapidly imprints core model values; subsequent preference optimization (PPO, DPO, SimPO) moves value alignment only when preference data exhibits strong stance contrast (“value-gap”). Algorithm choice (PPO preserves, DPO amplifies, SimPO moderates) is only impactful in the presence of high-quality preference signals (Bhatia et al., 30 Oct 2025).
- Forgetting: Sample-wise frameworks quantify 1→0 (forgetting) and 0→1 (backward transfer) transitions per item, demonstrating that RL/SFT on base or instruction-tuned models induces moderate gains in math/logic and generally low-to-moderate forgetting. Model merging does not reliably recover lost behavior. Monitoring per-stage F_true and BT_true enables targeted mitigation and understanding of catastrophic interference (Harmon et al., 20 Oct 2025).
- Cross-lingual Transfer: Multilingual instruction fine-tuning is characterized by task- and scale-dependent phase transitions: simple tasks saturate with few non-English examples per language, but complex reasoning requires orders of magnitude more data for parity, especially in small models. Model scale compresses the multilingual data requirement, and multi-task regimes require careful validation to avoid interference (Shimabucoro et al., 23 Apr 2025).
Robotics and Vision-Language-Action
- Steerability and Lock-In: Low-data SFT can induce concept or spatial lock-in, where policies overfit to narrowly demonstrated concepts or targets. Regulating weight drift in pre-trained visual encoders and using contrastive prompt-steering at test time restores generalization and navigability without requiring external models or data augmentation (Huang et al., 25 Apr 2026).
- World Model Distillation: Aligning VLA policies with the latent manifold of skill-compositional world models via contrastive objectives yields robust, temporally consistent actions, bypassing the instability of pixel-based supervision. Skill decomposition by LLM-based segmentation enables tractable, compositional world model post-training (Vuong et al., 11 Mar 2026).
Scientific Reasoning and Multimodal Models
In biology, post-training reshapes generalization through distinct non-monotonic dynamics. SFT increases in-domain accuracy at the expense of OOD performance, which peaks early and then declines. RL, when initialized from strong SFT checkpoints and properly regularized, can recover and surpass OOD generalization. Optimal resource allocation requires brief SFT and larger RL allocation; continued domain pre-training enhances both downstream phases (Fesser et al., 15 Jun 2026).
Video Generation under Physical Constraints
Enforcing Newtonian structure via verifiable rewards—constant-acceleration kinematic residuals and mass conservation proxies—substantially improves the temporal coherence and physical plausibility of learned video dynamics, even under OOD perturbations. The approach leverages frozen utility models for optical flow and mass embedding extraction, injecting explicit physical inductive bias post hoc (Le et al., 29 Nov 2025).
6. Parameter-Space and Representational Structural Changes
SVD-based analyses reveal that post-training, especially instruction-tuning and long-chain-of-thought distillation, consistently induces (1) near-uniform geometric rescaling of singular values—effectively modulating attention temperature—combined with (2) highly coordinated orthogonal rotations of left and right singular vectors within linear layers. The singular value scaling acts as a secondary, entropy-smoothing effect, while the functional adaptation is achieved primarily through these subspace rotations. Disrupting this orthogonal alignment leads to catastrophic degradation. Thus, post-training can be conceived as learning per-layer orthogonal transformations within pretrained subspaces, moving away from the “black box” drift paradigm (He et al., 22 Sep 2025).
7. Open Challenges and Future Directions
Rigorous theory for post-training dynamics remains incomplete. Open problems include:
- Predicting transfer, forgetting, or value-drift given training sequence and data properties.
- Quantifying support expansion versus elicitation of latent (pretrained but unreachable) behaviors, and developing reliable retention and consolidation metrics.
- Designing automated, task-adaptive schedules for RL post-training, particularly in non-stationary or curriculum-rich environments—recent LLM-driven agent systems highlight the need to differentiate capacity from regularization parameter schedules, supporting dynamic oscillation in the latter (Fang et al., 16 Jun 2026).
- Mechanistic interpretability of circuit and subspace formation across post-training phases.
- Extending state-centric and bias–variance frameworks to multimodal, compositional, and long-horizon domains.
Progress depends on coordinated system design—balancing provenance of supervision, support expansion, on-policy refinement, and consolidation—rather than isolated objective engineering (Zhao et al., 9 Apr 2026). An integrated science of post-training dynamics will underpin reliable, safe, and performant adaptation of general models to diverse deployment requirements.