VkSplat: Vulkan-Based 3D Gaussian Splatting Training

Updated 4 July 2026

VkSplat is a fully Vulkan-based 3D Gaussian Splatting training system that implements the entire training loop in Vulkan compute instead of the traditional CUDA + PyTorch stack.
It achieves a 3.3x speedup and reduces VRAM usage by 33% through explicit memory management and pre-allocation strategies, addressing inefficiencies and vendor lock-in.
Experiments on NVIDIA and AMD GPUs demonstrate quality preservation with competitive PSNR, SSIM, and LPIPS metrics while ensuring broad cross-vendor compatibility.

Searching arXiv for the primary VkSplat paper and a few closely related Gaussian Splatting papers for contextual support. VkSplat is a fully Vulkan-based 3D Gaussian Splatting (3DGS) training pipeline whose central contribution is to show that the entire 3DGS training loop can be implemented on the Vulkan compute stack rather than on the usual CUDA + PyTorch combination. It is presented as a response to the performance and compatibility limitations of CUDA-centric training pipelines, with the stated outcomes of a $3.3\times$ speedup, a $33\%$ VRAM reduction, quality preservation, and compatibility across GPU vendors (Chen et al., 30 Apr 2026).

1. Definition, scope, and motivation

VkSplat concerns training, not merely rendering. Its defining claim is that the full 3DGS workflow can be expressed in Vulkan compute, including forward rendering, backward propagation, optimization, and densification. The paper positions this as the first fully Vulkan-based 3DGS training pipeline and emphasizes that the novelty lies in replacing a CUDA + PyTorch training stack with a pure Vulkan implementation (Chen et al., 30 Apr 2026).

The problem setting is the standard 3DGS training regime in which PyTorch typically manages tensors, autograd, optimizer state, and much of the training orchestration, while custom CUDA kernels handle expensive rendering and rasterization. Two practical drawbacks are identified. First, the baseline is subject to vendor lock-in / limited portability, because CUDA is NVIDIA-specific and performance-critical 3DGS implementations are often tied to CUDA kernels. Second, it incurs memory overhead and inefficiency, since a PyTorch-based pipeline has allocator overhead, tensor management overhead, and often uses separate CUDA and framework-managed buffers. The paper further notes that densification can trigger dynamic buffer resizing, which increases VRAM usage and can create spikes that lead to OOMs.

The significance of a pure Vulkan implementation is therefore framed in system terms rather than algorithmic novelty alone. Vulkan provides a more controlled, explicit memory and execution model, avoids the CUDA-only assumption of many prior implementations, and broadens 3DGS training beyond NVIDIA-only ecosystems. A plausible implication is that VkSplat treats portability and memory behavior as first-class design objectives rather than incidental by-products of implementation.

2. End-to-end training pipeline in Vulkan compute

VkSplat implements the full 3DGS training loop in Vulkan compute. The supplementary material enumerates the stages explicitly:

Projection forward
Index offset
Generate keys
Sorting
Tile ranges
Rasterization forward
Copy image to device
Loss gradient
Rasterization backward 10. Projection backward + optimizer
Densification (Chen et al., 30 Apr 2026)

This breakdown is important because it shows that VkSplat is not a Vulkan rasterizer attached to an external ML framework. Projection, sorting/binning, rasterization, loss computation support, backward propagation, the optimizer step, and densification are all within Vulkan compute. The system therefore covers the entire iterative training pipeline: forward render, comparison with the target image via a loss, backpropagation through rasterization and projection, parameter updates, and Gaussian densification as needed.

The rendering strategy follows the standard scalable 3DGS pattern of per-splat key generation, sorting, tile-range construction, and tile-local rasterization. In VkSplat, that tiled rendering path is coupled to backward kernels and optimizer logic in the same compute environment. This suggests that the paper’s systems contribution is the elimination of framework boundaries that ordinarily divide orchestration, rendering, differentiation, and update logic across multiple runtime layers.

3. Memory model, buffer management, and densification behavior

A central theme of VkSplat is explicit memory control. Unlike PyTorch, Vulkan requires explicit allocation and tracking, and the implementation uses that requirement to minimize hidden allocator overhead, control memory layout directly, and avoid unnecessary intermediate copies (Chen et al., 30 Apr 2026).

The paper distinguishes two VRAM metrics:

Total VRAM: sum of allocated buffers
Peak VRAM: includes resizing spikes and better reflects whether OOM will occur

This distinction is methodologically important. The paper argues that total allocated memory alone can understate practical memory risk when densification induces short-lived resizing spikes. The supplementary discussion also notes that nvidia-smi sampling at 1 Hz can miss short-lived spikes, so measured process memory and allocator-reported values do not necessarily coincide exactly.

Two densification settings are evaluated:

Default densification
MCMC 1M densification

For MCMC 1M densification, the paper states that buffers are pre-allocated to fit the maximum number of Gaussians in order to avoid buffer resizing, and that this produces peak VRAM near-identical to total VRAM. In that setting, average total VRAM is reported as about 0.92–0.93 GiB. The implementation point is straightforward but consequential: pre-allocation converts resizing-driven transient peaks into a more stable memory profile.

For default densification, the reported average values are:

Metric	Reported value
Total VRAM	about 3.05–3.07 GiB
Peak VRAM	about 3.59–3.62 GiB

The paper presents these figures as evidence that the Vulkan implementation reduces VRAM relative to the CUDA/PyTorch baseline. It also reports that both VkSplat and GSplat underreport actual peak usage to some extent in the bicycle-scene comparison, but not in a way that invalidates the overall comparison. That caveat is notable because it frames memory reduction as an empirical systems result rather than a purely nominal accounting artifact.

4. Performance, quality, and cross-vendor execution

VkSplat reports both performance and quality outcomes. The headline claims are a $3.3\times$ speedup, 33% VRAM reduction, quality preservation, and cross-vendor compatibility (Chen et al., 30 Apr 2026). The portability claim is substantiated by experiments on both NVIDIA RTX 3090 and AMD Radeon RX 7800 XT.

The supplementary timing breakdown reports the following average total training times:

Setting	NVIDIA RTX 3090	AMD Radeon RX 7800 XT
Default densification	411.8 s	1113.8 s
MCMC 1M densification	285.4 s	853.1 s

The reported quality remains close to standard 3DGS baselines. The average metrics are given as approximately:

Metric	Reported range
PSNR	around 29.2 for default densification and 29.39 for MCMC 1M
SSIM	around 0.878–0.881
LPIPS Alex	around 0.124–0.130
LPIPS VGG	around 0.168–0.169

These values are presented as evidence that the performance and memory improvements do not introduce a large quality penalty. The paper therefore treats VkSplat as a systems optimization of 3DGS training rather than a trade-off in which efficiency is bought at the expense of reconstruction fidelity.

The cross-vendor result is not uniform across all stages. On AMD, the paper identifies Copy Image to Device as the largest slowdown, reporting that it is nearly 30× slower on average than on NVIDIA. The proposed explanation is PCIe differences, and asynchronous data transfer is suggested as a future optimization. This is an important caveat: the portability claim is affirmative, but the performance envelope remains sensitive to vendor-specific transfer behavior.

5. Novelty as a training-system result

The paper’s main novelty claim is narrower and more specific than a general claim about Vulkan rendering. What it presents as new is a modern GPU-accelerated 3D reconstruction/training system built entirely on a graphics API designed for portability and explicit control (Chen et al., 30 Apr 2026).

In that sense, VkSplat is defined by the removal of dependencies rather than by a new reconstruction objective. The supplied material does not include an explicit loss formula or update equation. Instead, the important mathematical and algorithmic content is system-level: forward render, loss-gradient computation, rasterization backward, projection backward, optimizer update, and densification. The paper’s emphasis falls on the fact that all of these operations are implemented in Vulkan compute rather than split across PyTorch and CUDA.

The paper therefore attributes the observed speed and memory gains to a combination of:

Explicit buffer management
Pre-allocation for MCMC densification
Tiled rendering with sorting and tile ranges
Vulkan compute kernels for both forward and backward passes
Cross-vendor execution

This framing matters because it distinguishes VkSplat from work that uses Vulkan only as a rendering substrate. The full training workflow is the unit of contribution. A plausible implication is that VkSplat broadens the implementation design space for performance-sensitive ML graphics workloads that have previously been assumed to require CUDA-bound training stacks.

6. Position within the Gaussian-splatting literature and naming ambiguities

Within the broader Gaussian-splatting literature, VkSplat is a training-systems paper rather than a segmentation, language, or unconstrained-reconstruction method. This distinction is useful because the supplied literature contains several splatting systems with related names but different objectives.

A separate paper, "Segment then Splat: A Unified Approach for 3D Open-Vocabulary Segmentation based on Gaussian Splatting", is also described in the supplied material as “VkSplat, or Segment then Splat,” but it addresses a different problem: assigning Gaussians to object-specific sets before reconstruction so that segmentation is embedded in the 3D representation itself (Lu et al., 28 Mar 2025). That work focuses on open-vocabulary 3D segmentation in static and dynamic scenes, not on Vulkan-based training infrastructure.

Likewise, "SplatTalk: 3D VQA with Gaussian Splatting" uses generalizable 3DGS to produce 3D tokens suitable for direct input into a pretrained LLM, thereby targeting zero-shot 3D visual question answering from posed RGB images only (Thai et al., 8 Mar 2025). "CrashSplat: 2D to 3D Vehicle Damage Segmentation in Gaussian Splatting" is a learning-free, single-view 2D-to-3D lifting strategy for vehicle damage segmentation (Chileban et al., 28 Sep 2025). "WildSplatter: Feed-forward 3D Gaussian Splatting with Appearance Control from Unconstrained Images" is a pose-free, feed-forward 3DGS model for unconstrained images with unknown camera parameters and varying lighting conditions (Fujimura et al., 23 Apr 2026).

Against that backdrop, VkSplat is best understood as a paper about how 3DGS training is executed, not about attaching language to Gaussians, segmenting objects, or reconstructing from unconstrained photo collections. Its specific contribution is to demonstrate that 3DGS training can be cross-vendor, memory-conscious, and fast within a fully Vulkan-based compute pipeline, while maintaining quality (Chen et al., 30 Apr 2026).