Papers
Topics
Authors
Recent
Search
2000 character limit reached

Runtime-Orchestrated Second-Order Optimization for Scalable LLM Training

Published 15 May 2026 in cs.DC and cs.LG | (2605.16184v1)

Abstract: Second-order methods offer an attractive path toward more sample-efficient LLM training, but their practical use is often blocked by the systems cost of maintaining and updating large matrix-based optimizer states. We introduce \textbf{Asteria}, a runtime system designed to remove this bottleneck by separating second-order optimization logic from the critical GPU training path. Rather than keeping all preconditioner state on the accelerator, Asteria dynamically distributes optimizer state across GPU memory, CPU memory, and optional NVMe storage according to architectural constraints and runtime pressure. It further uses training hooks to prepare shadow states in advance, allowing expensive inverse-root computations to proceed asynchronously on the host while GPU computation continues. For distributed training, Asteria employs a bounded-staleness protocol that limits synchronization frequency while preserving optimizer effectiveness through topology-aware coordination. We evaluate Asteria on both memory-constrained and distributed training settings. On a DGX Spark platform with a single GB10 GPU and 128GB unified memory, Asteria supports second-order training for a 1B-parameter LLM. On multi-node GH200 systems, it lowers visible optimizer overhead, reduces recurring latency spikes, accelerates convergence in wall-clock time, and maintains the optimization advantages of SOAP and KL-Shampoo in a 7B-parameter LLM. Our results suggest that second-order LLM training can be made practical not by simplifying the optimizer alone, but by rethinking how optimizer state, background computation, and distributed synchronization are managed at the runtime level.

Summary

  • The paper introduces Asteria, which decouples optimizer state management from GPU training to enable efficient second-order LLM training.
  • It deploys architecture-adaptive memory tiering and asynchronous shadow-stream scheduling to hide latency and reduce energy consumption.
  • Asteria achieves near-constant GPU throughput and superior loss reduction efficiency compared to first-order baselines like AdamW.

Runtime-Orchestrated Second-Order Optimization for Scalable LLM Training: An Expert Analysis

Introduction

Second-order optimization promises improved sample efficiency and asymptotic performance for LLM training by leveraging curvature-aware updates, as found in methods like Shampoo, SOAP, and KL-Shampoo. However, the systems cost—O(N3)\mathcal{O}(N^3) compute and O(N2)\mathcal{O}(N^2) memory for handling optimizer state—has rendered such methods often impractical, particularly in memory-constrained or bandwidth-constrained environments. The paper "Runtime-Orchestrated Second-Order Optimization for Scalable LLM Training" (2605.16184) introduces Asteria, an architecture-aware runtime system that incorporates second-order optimization into large-scale LLM training without sacrificing system efficiency, throughput, or scalability.

Core Innovations in Asteria Runtime

Asteria isolates optimizer state management from the critical GPU training path and orchestrates the handling of second-order statistics across heterogeneous memory tiers—GPU, CPU, and NVMe—adapting to hardware constraints and runtime pressure. The approach includes key mechanisms:

  • Architecture-Adaptive Asymmetric Memory Tiering: States are dynamically distributed according to their compute lifecycles—Kronecker factors are updated on the GPU, inverse factors are managed in CPU-visible memory, and expensive inverse-root updates are executed asynchronously on the host.
  • Hook-Orchestrated Shadow-State Pipeline: Expensive matrix operations are moved off-path, with state updates scheduled via lightweight PyTorch hooks so prefetching and staging can proceed asynchronously, exploiting slack time in training execution.
  • Bounded-Staleness Selective Coherence: Staleness-aware synchronization leverages process-group hierarchies, reducing global communication by refreshing only preconditioners whose "freshness" falls outside set bounds. This further minimizes bandwidth usage and eliminates redundant host-device transfers. Figure 1

    Figure 1: System architecture overview, illustrating the dual-path design that decouples training computation from optimizer state maintenance.

Heterogeneous Memory Tiering and Lifecycle-Aware State Placement

Second-order optimizers exhibit a sharply asymmetric lifecycle of matrix state: factor matrices are updated frequently on accelerators, but inverse-root computations are infrequent, expensive, and unneeded on the critical path. Asteria's memory tiering strategy renders possible second-order LLM training even when GPU capacity is limited, by leveraging UVM-backed host memory and (optionally) NVMe storage to persist intermediate states and offload cold data. When necessary, explicit reclamation (madvise) shrinks the working set. Figure 2

Figure 2: Memory tiering path for second-order state in Asteria, showing selective placement and flow of preconditioner matrices across devices and storage tiers.

Shadow-Stream Scheduling and Asynchronous Preconditioner Computation

Asteria’s utilization of shadow streams and CPU-based asynchronous computation is critical for flattening the notorious O(N3)\mathcal{O}(N^3) latency spikes observed in naïve second-order training. By staging updates and enabling just-in-time prefetch via hooks, Asteria ensures that heavy numeric compute (e.g., eigendecompositions or matrix roots) is fully overlapped with GPU execution, relegating synchronization strictly to the boundaries needed by the algorithm. Figure 3

Figure 3: Shadow-stream staging and host-side asynchronous inverse-root updates in Asteria, decoupling critical path training from second-order state maintenance.

Evaluation: Throughput, Latency, and Energy Efficiency

Latency Hiding in Memory-Constrained Environments

Asteria eliminates the stochastic latency spikes of native SOAP and KL-Shampoo by backgrounding the expensive second-order steps (preconditioner updates), achieving near-constant GPU throughput and negligible exposed step-wise overhead relative to AdamW. Figure 4

Figure 4: Step time distribution across training steps for OLMo-2-1B on the DGX Spark, demonstrating the removal of periodic second-order latency spikes by Asteria.

Figure 5

Figure 5: Step-time breakdown at the preconditioning boundary, with Asteria almost entirely hiding preconditioning cost from the critical path.

Energy and Power Dynamics

Naïve second-order implementations incur substantially higher energy per unit loss reduction; Asteria reduces both total energy and SoC energy, shifting execution to a higher-power, shorter-duration regime that enables improved hardware utilization. Figure 6

Figure 6: Energy and power comparison for OLMo-2-1B training, with Asteria reducing total energy relative to native second-order baselines and improving the SoC energy tradeoff.

Asteria-KL-Shampoo produces the best normalized loss-reduction efficiency among all tested variants, surpassing even the AdamW baseline in terms of loss reduction per joule expended. Figure 7

Figure 7: Energy-loss tradeoff for OLMo-2-1B, where Asteria yields higher efficiency than both first-order and native second-order optimizers.

Distributed Scaling, Staleness Tolerance, and Convergence Behavior

Asynchrony and Bounded Staleness

Asteria's capability to tolerate bounded staleness without degrading optimization performance is empirically validated for both loss and wall-clock time. In distributed training, a moderate staleness window (S=3S=3 to S=5S=5) suffices to hide nearly all overhead, with no adverse impact on convergence. Figure 8

Figure 8: Training loss for 660M pretraining runs, demonstrating Asteria's preservation of the second-order convergence advantage in both optimizer steps and wall time.

Figure 9

Figure 9: Effect of staleness budget on training time and final evaluation loss, confirming that increased staleness reduces overhead with negligible impact on accuracy.

Scalability on Large Clusters

Asteria's efficacy persists at scale, with strong-scaling efficiency maintained even as node and model size increase. For 1B and 7B models, Asteria-based second-order optimizers outpace AdamW in time-to-target-loss. Figure 10

Figure 10: Training loss over wall time for 1B and 7B pretraining runs, revealing Asteria's maintenance of second-order advantages at scale.

Strong-scaling experiments demonstrate higher realized parallel efficiency and lower per-step execution time with Asteria across a 2–16 node range. Figure 11

Figure 11: Strong-scaling for 7B training, with Asteria achieving superior scaling efficiency compared to native second-order baselines.

Theoretical and Practical Implications

The research concretely demonstrates that the bottleneck in large-scale second-order optimization is more systemic (state management and synchronization) than purely algorithmic. By decoupling matrix updates and taking a lifecycle-aware approach to memory, Asteria renders previously infeasible training configurations (single-GPU, tight-DRAM nodes) tractable for second-order methods. This mandates a shift in the community's perspective—practical second-order optimization at scale requires careful hardware-software co-design, runtime orchestration, and staleness-aware communication.

Practically, Asteria opens new deployment scenarios for second-order optimization on both commodity workstations and bandwidth-sensitive distributed clusters. The approach’s energy-efficiency improvements further reinforce the system’s value for institutions operating under power constraints.

Theoretically, the results suggest that for LLM training, the statistical gains of curvature-aware optimization may be reliably realized in real-world, large-scale settings with appropriate runtime engineering.

Future Directions

Further work could adapt Asteria to tensor- and pipeline-parallel regimes, exploit emerging memory technologies (e.g., HBM-NVMe interleaving), or refine asynchrony-control policies based on online monitoring of loss and hardware counters. There is also potential for integrating capability detection to automatically configure the system’s memory and staging behaviors per hardware node.

Conclusion

Asteria establishes that with jointly engineered runtime, memory, and communication strategies, sample-efficient second-order optimization becomes feasible, scalable, and efficient for LLM pretraining in hardware-constrained and distributed systems. These results substantively change the calculus of optimizer selection in practical deep learning, providing a robust path by which the statistical advantages of second-order methods can be harvested at scale and across heterogeneous compute environments.

Paper to Video (Beta)

No one has generated a video about this paper yet.

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Collections

Sign up for free to add this paper to one or more collections.

Tweets

Sign up for free to view the 3 tweets with 18 likes about this paper.