Foundry: Template-Based CUDA Graph Context Materialization for Fast LLM Serving Cold Start

Published 8 Apr 2026 in cs.DC and cs.LG | (2604.06664v1)

Abstract: Modern LLM service providers increasingly rely on autoscaling and parallelism reconfiguration to respond to rapidly changing workloads, but cold-start latency remains a major bottleneck. While recent systems have reduced model weight loading to seconds, CUDA graph capture still takes tens of seconds to minutes and often dominates startup. Unfortunately, CUDA graphs cannot be naively serialized: beyond graph topology, they are tightly coupled to execution context, including device addresses embedded in kernel arguments and kernel code lazily loaded during warmup. Existing approaches either rely on brittle kernel-specific patching or heavyweight process-level checkpoint/restore that are inflexible to dynamic parallelism switching. We present Foundry, a template-based CUDA graph context materialization system that persists both graph topology and execution context during an offline processing stage, and reconstructs executable graphs online with negligible overhead. Foundry enforces deterministic memory layouts, automatically extracts and reloads kernel binaries required by captured graphs, and reduces online reconstruction costs through topology-based templating. For distributed serving, Foundry further enables a single-GPU offline capture to generate templates for multi-GPU deployments by patching only rank-dependent communication state. Across dense and MoE models up to 235B parameters, Foundry reduces cold-start latency by up to 99%, cutting the initialization time of Qwen3-235B-A22B from 10 minutes to 3.9 seconds while preserving the throughput gains of CUDA graphs.

Abstract PDF Upgrade to Chat

Authors (6)

Summary

The paper introduces Foundry, which persistently captures both CUDA graph topology and full execution context to drastically reduce cold-start latency for LLM serving.
It employs a two-phase SAVE and LOAD pipeline that leverages deterministic memory layouts and kernel binary extraction for rapid, kernel-agnostic graph restoration.
Evaluation results show up to a 99% reduction in initialization time with preserved throughput, enabling efficient scaling in distributed, multi-GPU environments.

Foundry: Template-Based CUDA Graph Context Materialization for Fast LLM Serving Cold Start

Motivation and Background

Dynamic autoscaling and rapid parallelism reconfiguration are essential for operational efficiency in production-scale LLM inference, especially under highly dynamic workloads where request rates and sequence lengths fluctuate significantly. While recent advances have reduced model weight loading to mere seconds via optimized transfer protocols, cold-start latency remains dominated by CUDA graph capture, which can require tens of seconds to minutes even after prior optimizations. The inability to serialize CUDA graphs—due to their deep entanglement with device-specific execution context, such as embedded device pointers and kernel handles—renders existing solutions brittle or heavyweight, limiting their applicability and generality.

Figure 1: Breakdown of vLLM worker initialization when serving Qwen3-14B on 2xH200s. The graph capture step dominates the initialization time.

CUDA graphs are critical for high-throughput LLM inference engines; disabling them results in significant performance degradation due to increased host-to-device launch overhead.

Figure 2: Time per output token (TPOT) increases notably when decoding with vLLM without CUDA graphs (across batch sizes), reflecting the critical impact of graph-level kernel fusion.

Medusa exemplifies topology-only capture, which demands fragile, kernel-specific patching at restore-time and fails to robustly accommodate evolving kernel layouts or new hardware and model architectures. By contrast, Foundry’s approach is to persist both the graph topology and its full execution context.

Figure 3: Medusa captures only topology, while Foundry persists the necessary execution context, making restoration kernel-agnostic.

System Overview and Key Insights

Foundry introduces a two-phase template-based pipeline for rapid, kernel-agnostic CUDA graph restoration:

SAVE Phase: Foundry executes a warmup and full graph capture once, intercepting and recording all execution context required for correct graph replay—specifically, deterministic memory layouts and in-memory kernel binaries.
LOAD Phase: At serving startup, Foundry restores the execution context and reconstructs executable graphs from templates, bypassing full graph capture and expensive initialization logic.
Figure 4: Foundry's workflow: offline SAVE captures graph metadata and context, producing a portable archive; LOAD reconstructs graphs at startup with minimal overhead.

Crucially, Foundry generalizes efficiently to distributed, multi-GPU environments: a single-GPU offline capture suffices to generate templates that can be patched online with rank-dependent communication state (e.g., for NCCL or NVSHMEM), supporting dynamic parallelism reconfiguration with no redundant warmups.

This interposition is achieved by Foundry acting as a CUDA driver hook.

Figure 5: Foundry is injected via a CUDA driver hook, redirecting allocations and kernel loading for deterministic replay.

Design: Context Materialization and Templating

Execution Context Materialization

Deterministic Memory Layout: By intercepting and redirecting memory allocations using CUDA’s virtual memory management APIs, Foundry ensures that device pointers embedded in captured graph nodes remain valid and consistent across SAVE and LOAD. The allocation sequence is recorded and deterministically replayed to guarantee pointer stability—even when transient intermediates differ between phases.
Kernel Binary Extraction and Reload: By intercepting module load APIs, Foundry captures both the payload and function mapping (by content hash and mangled name) for all relevant kernel binaries, enabling precise function handle restoration without re-triggering warmup logic.

Efficient Graph Reconstruction via Templating

Graph captures for different batch sizes and ranks typically share a small set of unique topologies, varying only in per-node parameters (e.g., arguments, launch dimensions).

Figure 6: Graph templates encode shared topologies, while parameter sets provide per-instance specialization.

Rather than instantiating one graph per instance, Foundry builds one template per unique topology; parameter updates are issued on demand, leveraging the CUDA driver’s efficient in-place node parameter update API. Topology grouping is encoded during SAVE and exploited during LOAD for concurrent, lock-free preparation of parameter sets, significantly reducing wall-clock graph build time and resource contention.

Evaluation

Cold-Start Latency and Scalability

Foundry achieves up to 99% cold-start latency reduction compared to native vLLM with CUDA graphs, shrinking initialization from minutes to a few seconds across both dense and MoE models up to 235B parameters.

Figure 7: Foundry reduces engine initialization time by up to 99% relative to vLLM’s default CUDA graph capture across multiple model-parallelism configurations.

A detailed breakdown systematizes this improvement across both dense and expert-parallel deployments.

Figure 8: Phase analysis of end-to-end initialization: Foundry consistently outperforms both vLLM and CUDA-checkpoint, eliminating graph capture as a bottleneck.

Serving Throughput Preservation

Throughput analysis demonstrates no statistically significant degradation: TPOT with Foundry overlaps nearly perfectly with baseline native graph-capture performance.

Figure 9: Mean TPOT as a function of batch size for vLLM and Foundry: identical throughput confirms semantic equivalence.

This validates the semantic fidelity of Foundry’s execution context materialization and template-based restoration.

Templating Efficiency

Templating compresses the large set of captured graphs (e.g., 512 for batch sizes 1–512) to a handful of templates (12–25 per model, comprising 2–5% of graphs), allowing over 95% of graphs to be served via rapid in-place parameter update rather than full reconstruction.

Figure 10: Per-graph construction cost: parameter update is 24–32x faster than template instantiation; both far surpass native stream capture speed.

Figure 11: Fraction of graphs (per model) that can be served by on-demand parameter update, highlighting the dramatic compression to a small set of templates.

Theoretical and Practical Implications

The separation of topology and execution context enables kernel/dashboard/library-agnostic graph restoration, removing the need for kernel-specific patching and enhancing portability to evolving model architectures and hardware platforms. By generalizing to SPMD-style parallel inference, Foundry reduces both the hardware cost and storage requirements of graph materialization, facilitating flexible, fine-grained autoscaling and dynamic parallelism policy optimization.

Practically, Foundry enables large-scale LLM deployments to minimize the "time to first token" during autoscale-up or parallelism reconfiguration events, directly improving resource elasticity and user experience.

Theoretically, this approach suggests a path toward general-purpose, hardware-agnostic snapshotting of heterogeneous accelerator execution graphs—a direction with potential applicability to other domains beyond LLM inference.

Conclusion

Foundry provides a robust, kernel-agnostic solution to the cold-start bottleneck in scalable LLM serving by coupling template-based CUDA graph context materialization with deterministic execution context restoration. Empirical results demonstrate orders-of-magnitude latency reduction while maintaining peak throughput and minimal storage overhead. Foundry establishes a general, portable, and efficient framework for rapid LLM service startup and dynamic reconfiguration (2604.06664).

Markdown Report Issue