Instanced Software Rasterizer
- Instanced Software Rasterizer is a GPU-accelerated rendering pipeline that uses a neural MLP for view-dependent occlusion culling on 3D Gaussian primitives.
- It integrates efficient frustum culling, per-pixel compositing, and tile-based binning to optimize real-time image synthesis in large-scale, complex scenes.
- The architecture achieves significant VRAM savings and improved frame rates by selectively processing massive instance-level duplications with learned visibility.
An instanced software rasterizer, as implemented in "NVGS: Neural Visibility for Occlusion Culling in 3D Gaussian Splatting," is a GPU-accelerated rendering pipeline designed for real-time image synthesis of scenes containing large numbers of 3D Gaussian Splatting (3DGS) primitives, particularly in scenarios where massive instance-level duplication (composition) is required. This architecture enables efficient frustum and occlusion culling, per-pixel compositing, and level-of-detail (@@@@2@@@@) selection for high-fidelity, memory-efficient scene rendering. The defining feature is the integration of a neural visibility multilayer perceptron (MLP) invoked per-instance per-Gaussian to perform learned, view-dependent occlusion culling, directly on Tensor Cores (Zoomers et al., 24 Nov 2025).
1. Scene Organization and Data Flow
A scene is defined as a composition of distinct 3DGS assets (such as avatars, vegetation, or artifacts), each instantiated times via affine transforms . Each asset contains a set of Gaussians, parameterized by mean , covariance , color, and opacity. Two neural network structures are associated per asset:
- A visibility-MLP (shared by all instances of )
- A secondary MLP producing a 6D per-Gaussian feature array
Preprocessing removes low-opacity Gaussians and recenters assets. Visibility labels are extracted for each Gaussian from dense multi-view renders, used to train both the per-Gaussian feature MLP and the primary asset-specific visibility MLP, , which maps a 16-dimensional vector—comprising local-space Gaussian statistics, viewing direction, normalized distance, camera forward vector, and the 6D precomputed feature—onto a binary visibility indicator .
At runtime, all Gaussian and per-Gaussian features are uploaded to the GPU. Each frame, the pipeline evaluates, per instance:
- Frustum culling on each Gaussian’s mean in world space.
- Neural occlusion culling by batch-evaluating for frustum-surviving Gaussians using tensor core–optimized MLP forward passes.
- Surviving Gaussians are gathered into a global "activeList" and binned into screen-space tiles for parallel tile-based raster workgroups.
- Final splatting and shading proceed per tile, compositing each eligible Gaussian into the frame buffer.
2. Algorithms and Data Structures
A strict data-of-arrays layout is employed: positions, covariances, colors, and 6D Gaussian features occupy contiguous memory, optimizing GPU access. Indices are used for the dynamic activeList and per-tile tileList.
Frustum culling uses per-Gaussian dot-product tests against precomputed frustum planes. Tile assignment is performed using the 2D axis-aligned bounding box (AABB) at projected extent, mapping to integer tile indices; corresponding Gaussian indices are appended to each tileList.
Splat rendering adheres to the standard 3DGS "splatting" pipeline: for each covered pixel , compute the Gaussian footprint weight , apply spherical-harmonic shading, and blend via front-to-back compositing.
Overview of Pipeline Data Structures
| Data Structure | Purpose | Storage Unit |
|---|---|---|
| positions[] | Mean of each Gaussian | |
| covariances[] | 2D covariance per Gaussian | |
| Feat[] [6] | Per-Gaussian neural features | floats |
| activeList | Dynamic indices of visible Gaussians | triplets |
| tileList[T] [*] | Per-tile lists for binned Gaussians | Varies per tile, total |
3. Neural Occlusion Culling Integration
The neural occlusion-culling MLP is evaluated on Tensor Cores, immediately following frustum culling but preceding any subsequent binning or compositing. Its input for each Gaussian includes:
- Local-space mean position
- Local viewing direction
- Local camera forward vector
- Normalized distance to camera
- Per-Gaussian precomputed 6D feature vector
These are packed into a 16-float input using batch processing for tens-of-thousands of Gaussians; evaluation thresholds () on determine final inclusion in the activeList. Only Gaussians passing this neural visibility test are retained for further rasterization. Tensor Core efficiency is maximized by using fixed 16-dimensional inputs and hidden layers sized as multiples of 16.
4. Full Rendering Pipeline Execution
The instanced rasterizer proceeds via the following frame-loop pseudocode:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 |
FrameBuffer clearToBackground(); activeList.clear(); // 1) Frustum cull and neural occlusion cull for (each asset a) for (k = 0; k < M_a; ++k) { mat4 invT = invTransforms[a][k]; mat3 invRot = mat3(invT); for (g = 0; g < N[a]; ++g) { vec3 mu_ws = transforms[a][k] * μ[a][g]; if (!inFrustum(mu_ws)) continue; vec3 mu_ls = invT * vec4(mu_ws,1.0); vec3 dir_ws = normalize(camPos - mu_ws); vec3 dir_ls = invRot * dir_ws; float dist_ws = length(camPos - mu_ws); float dist_ls = dist_ws * (f_t[a]/f_r) * (1.0/s[a]); float d_norm = (dist_ls - d_min[a])/(d_max[a] - d_min[a]) * 2 - 1; // Pack input[16] if (fa_vis_a.forward(input) <= 0.5) continue; activeList.append({a, k, g}); } } // 2) Tile-based binning for (each entry e in activeList) { ProjectionResult pr = projectAndBBox(μ[e.a][e.g], Σ[e.a][e.g]); for (t ∈ tiles_covered_by(pr.bbox)) tileList[t].append(e); } // 3) Rasterization parallel_for (each tile t) { for (e : tileList[t]) { // For each pixel p weight = gaussianWeight(p, e.Σ_ws); shade = evalSH(e.color, p.normal?); FrameBuffer[p] = composite(FrameBuffer[p], shade * weight); } } // 4) Display display(FrameBuffer); |
Tensor Core acceleration is used exclusively within the batch-forward pass of the MLP, via packing many input vectors into or matrices (TCNN convention).
5. Performance and Scaling
Let denote total Gaussians, the number passing frustum culling, those selected by the MLP for rendering, and the pixel count. Time complexity per frame is given by
Memory consumption comprises:
- Gaussian storage: B
- Feature vectors: B
- MLP weights: kB per asset
- Tile lists and frame buffer:
In aggregate, B plus tile and frame buffer space. Empirically, for Gaussians per scene and – rendered per frame, measured throughput on NVIDIA RTX 3090 Ti (24 GB) yields:
- Frustum-only pipeline (no MLP): 60 FPS at 1080p
- With MLP-based culling: 50–55 FPS (approximate 10 FPS gain over a no-culling baseline)
- Full instanced pipeline VRAM usage: 4 GB, reduced by ~25% compared to absence of MLP-based occlusion culling (Zoomers et al., 24 Nov 2025)
6. Architectural Optimizations
Multiple orthogonal and complementary optimizations are implemented:
- Frustum culling: Eliminates 60–90% of Gaussians before MLP evaluation (setting ).
- Tensor Core–friendly MLP: 16-dimensional fixed input and hidden layer multiples ensure near-maximal Tensor Core utilization.
- Precomputed features: 6D features reduce runtime encoding and avoid costly online trigonometric computations.
- Tile-based binning: Spatially local tileLists enhance memory access patterns and mitigate overdraw.
- Distance-based radius clipping: Optionally eliminates distant Gaussians, trading minor perceptual error for throughput.
- Level-of-detail integration: Enables LoD subset selection before entering the main pipeline. The MLP-based occlusion culling and instanced rasterization are orthogonal to LoD selection.
This configuration enables real-time, VRAM-efficient instanced rasterization, never instantiating occluded or out-of-frustum Gaussians, integrating neural visibility queries directly into the frame loop, and scaling to primitives in composed scenes (Zoomers et al., 24 Nov 2025).
7. Significance and Context
The described instanced software rasterizer addresses the principal challenge in 3DGS compositional rendering: the high cost of per-frame occlusion culling for semi-transparent primitives. By designing a visibility-MLP with minimal memory footprint (18 kB per asset), amenable to Tensor Core acceleration, the approach integrates learned occlusion culling into the mainstream of the rasterization pipeline. The architecture demonstrates VRAM savings, improved frame rates, and scalability compared to non-MLP–based methods. A plausible implication is that similar learned per-instance culling could be adapted for other parametric primitive representations in neural or hybrid graphics pipelines, especially where conventional hardware occlusion culling is ineffective.