Papers
Topics
Authors
Recent
2000 character limit reached

Instanced Software Rasterizer

Updated 25 November 2025
  • Instanced Software Rasterizer is a GPU-accelerated rendering pipeline that uses a neural MLP for view-dependent occlusion culling on 3D Gaussian primitives.
  • It integrates efficient frustum culling, per-pixel compositing, and tile-based binning to optimize real-time image synthesis in large-scale, complex scenes.
  • The architecture achieves significant VRAM savings and improved frame rates by selectively processing massive instance-level duplications with learned visibility.

An instanced software rasterizer, as implemented in "NVGS: Neural Visibility for Occlusion Culling in 3D Gaussian Splatting," is a GPU-accelerated rendering pipeline designed for real-time image synthesis of scenes containing large numbers of 3D Gaussian Splatting (3DGS) primitives, particularly in scenarios where massive instance-level duplication (composition) is required. This architecture enables efficient frustum and occlusion culling, per-pixel compositing, and level-of-detail (@@@@2@@@@) selection for high-fidelity, memory-efficient scene rendering. The defining feature is the integration of a neural visibility multilayer perceptron (MLP) invoked per-instance per-Gaussian to perform learned, view-dependent occlusion culling, directly on Tensor Cores (Zoomers et al., 24 Nov 2025).

1. Scene Organization and Data Flow

A scene is defined as a composition of AA distinct 3DGS assets (such as avatars, vegetation, or artifacts), each instantiated MaM_a times via affine transforms Ta[k],k=1MaT_a[k], k = 1 \dots M_a. Each asset aa contains a set of NaN_a Gaussians, parameterized by mean μR3\mu \in \mathbb{R}^3, covariance ΣR2×2\Sigma \in \mathbb{R}^{2\times2}, color, and opacity. Two neural network structures are associated per asset:

  • A visibility-MLP fa,visf_{a,\text{vis}} (shared by all instances of aa)
  • A secondary MLP producing a 6D per-Gaussian feature array Feat[a][i]R6\text{Feat}[a][i] \in \mathbb{R}^6

Preprocessing removes low-opacity Gaussians and recenters assets. Visibility labels are extracted for each Gaussian from dense multi-view renders, used to train both the per-Gaussian feature MLP and the primary asset-specific visibility MLP, fa,visf_{a,\text{vis}}, which maps a 16-dimensional vector—comprising local-space Gaussian statistics, viewing direction, normalized distance, camera forward vector, and the 6D precomputed feature—onto a binary visibility indicator v{0,1}v \in \{0,1\}.

At runtime, all Gaussian and per-Gaussian features are uploaded to the GPU. Each frame, the pipeline evaluates, per instance:

  • Frustum culling on each Gaussian’s mean in world space.
  • Neural occlusion culling by batch-evaluating fa,visf_{a,\text{vis}} for frustum-surviving Gaussians using tensor core–optimized MLP forward passes.
  • Surviving Gaussians are gathered into a global "activeList" and binned into screen-space tiles for parallel tile-based raster workgroups.
  • Final splatting and shading proceed per tile, compositing each eligible Gaussian into the frame buffer.

2. Algorithms and Data Structures

A strict data-of-arrays layout is employed: positions, covariances, colors, and 6D Gaussian features occupy contiguous memory, optimizing GPU access. Indices are used for the dynamic activeList and per-tile tileList.

Frustum culling uses per-Gaussian dot-product tests against precomputed frustum planes. Tile assignment is performed using the 2D axis-aligned bounding box (AABB) at 1σ1\sigma projected extent, mapping to integer tile indices; corresponding Gaussian indices are appended to each tileList.

Splat rendering adheres to the standard 3DGS "splatting" pipeline: for each covered pixel pp, compute the Gaussian footprint weight w=exp(12(pp0)Σ1(pp0))w = \exp\left(-\frac{1}{2}(p - p_0)^\top \Sigma^{-1}(p - p_0)\right), apply spherical-harmonic shading, and blend via front-to-back compositing.

Overview of Pipeline Data Structures

Data Structure Purpose Storage Unit
positions[NN] Mean of each Gaussian R3×N\mathbb{R}^3 \times N
covariances[NN] 2D covariance per Gaussian R2×2×N\mathbb{R}^{2\times2} \times N
Feat[NN] [6] Per-Gaussian neural features 6×N6 \times N floats
activeList Dynamic indices of visible Gaussians K\leq K triplets (a,k,g)(a,k,g)
tileList[T] [*] Per-tile lists for binned Gaussians Varies per tile, total O(K)O(K)

3. Neural Occlusion Culling Integration

The neural occlusion-culling MLP is evaluated on Tensor Cores, immediately following frustum culling but preceding any subsequent binning or compositing. Its input for each Gaussian includes:

  • Local-space mean position μls\mu_{ls}
  • Local viewing direction dirls\text{dir}_{ls}
  • Local camera forward vector
  • Normalized distance to camera dnormd_{\text{norm}}
  • Per-Gaussian precomputed 6D feature vector Feat[a][gindex]\text{Feat}[a][g_{\text{index}}]

These are packed into a 16-float input using batch processing for tens-of-thousands of Gaussians; evaluation thresholds (>0.5>0.5) on fa,visf_{a,\text{vis}} determine final inclusion in the activeList. Only Gaussians passing this neural visibility test are retained for further rasterization. Tensor Core efficiency is maximized by using fixed 16-dimensional inputs and hidden layers sized as multiples of 16.

4. Full Rendering Pipeline Execution

The instanced rasterizer proceeds via the following frame-loop pseudocode:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
FrameBuffer clearToBackground();
activeList.clear();

// 1) Frustum cull and neural occlusion cull
for (each asset a)
  for (k = 0; k < M_a; ++k) {
    mat4 invT = invTransforms[a][k];
    mat3 invRot = mat3(invT);
    for (g = 0; g < N[a]; ++g) {
      vec3 mu_ws = transforms[a][k] * μ[a][g];
      if (!inFrustum(mu_ws)) continue;
      vec3 mu_ls = invT * vec4(mu_ws,1.0);
      vec3 dir_ws = normalize(camPos - mu_ws);
      vec3 dir_ls = invRot * dir_ws;
      float dist_ws = length(camPos - mu_ws);
      float dist_ls = dist_ws * (f_t[a]/f_r) * (1.0/s[a]);
      float d_norm  = (dist_ls - d_min[a])/(d_max[a] - d_min[a]) * 2 - 1;
      // Pack input[16]
      if (fa_vis_a.forward(input) <= 0.5) continue;
      activeList.append({a, k, g});
    }
  }

// 2) Tile-based binning
for (each entry e in activeList) {
  ProjectionResult pr = projectAndBBox(μ[e.a][e.g], Σ[e.a][e.g]);
  for (t  tiles_covered_by(pr.bbox))
    tileList[t].append(e);
}

// 3) Rasterization
parallel_for (each tile t) {
  for (e : tileList[t]) {
    // For each pixel p
    weight = gaussianWeight(p, e.Σ_ws);
    shade  = evalSH(e.color, p.normal?);
    FrameBuffer[p] = composite(FrameBuffer[p], shade * weight);
  }
}

// 4) Display
display(FrameBuffer);

Tensor Core acceleration is used exclusively within the batch-forward pass of the fa,visf_{a,\text{vis}} MLP, via packing many 16×116 \times 1 input vectors into 16×1616 \times 16 or 16×6416 \times 64 matrices (TCNN convention).

5. Performance and Scaling

Let NN denote total Gaussians, MM the number passing frustum culling, KK those selected by the MLP for rendering, and PP the pixel count. Time complexity per frame is given by

T(N,M,K)=O(N) [frustum tests] +O(M) [MLP evals] +O(K+tile work) [splat/compose] =O(N+M+K)T(N, M, K) = O(N) \text{ [frustum tests] } + O(M) \text{ [MLP evals] } + O(K + \sum \text{tile work}) \text{ [splat/compose] } = O(N + M + K)

Memory consumption comprises:

  • Gaussian storage: N×32\approx N \times 32 B
  • Feature vectors: N×24N \times 24 B
  • MLP weights: 18\approx 18 kB per asset
  • Tile lists and frame buffer: O(K+P)O(K + P)

In aggregate, (32+24)N(32+24)N B plus tile and frame buffer space. Empirically, for N60MN \approx 60\,\text{M} Gaussians per scene and K1K \approx 15M5\,\text{M} rendered per frame, measured throughput on NVIDIA RTX 3090 Ti (24 GB) yields:

  • Frustum-only pipeline (no MLP): 60 FPS at 1080p
  • With MLP-based culling: 50–55 FPS (approximate 10 FPS gain over a no-culling baseline)
  • Full instanced pipeline VRAM usage: 4 GB, reduced by ~25% compared to absence of MLP-based occlusion culling (Zoomers et al., 24 Nov 2025)

6. Architectural Optimizations

Multiple orthogonal and complementary optimizations are implemented:

  • Frustum culling: Eliminates 60–90% of Gaussians before MLP evaluation (setting MNM \ll N).
  • Tensor Core–friendly MLP: 16-dimensional fixed input and hidden layer multiples ensure near-maximal Tensor Core utilization.
  • Precomputed features: 6D features reduce runtime encoding and avoid costly online trigonometric computations.
  • Tile-based binning: Spatially local tileLists enhance memory access patterns and mitigate overdraw.
  • Distance-based radius clipping: Optionally eliminates distant Gaussians, trading minor perceptual error for throughput.
  • Level-of-detail integration: Enables LoD subset selection before entering the main pipeline. The MLP-based occlusion culling and instanced rasterization are orthogonal to LoD selection.

This configuration enables real-time, VRAM-efficient instanced rasterization, never instantiating occluded or out-of-frustum Gaussians, integrating neural visibility queries directly into the frame loop, and scaling to >100M>100\,\text{M} primitives in composed scenes (Zoomers et al., 24 Nov 2025).

7. Significance and Context

The described instanced software rasterizer addresses the principal challenge in 3DGS compositional rendering: the high cost of per-frame occlusion culling for semi-transparent primitives. By designing a visibility-MLP with minimal memory footprint (\sim18 kB per asset), amenable to Tensor Core acceleration, the approach integrates learned occlusion culling into the mainstream of the rasterization pipeline. The architecture demonstrates VRAM savings, improved frame rates, and scalability compared to non-MLP–based methods. A plausible implication is that similar learned per-instance culling could be adapted for other parametric primitive representations in neural or hybrid graphics pipelines, especially where conventional hardware occlusion culling is ineffective.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)
Slide Deck Streamline Icon: https://streamlinehq.com

Whiteboard

Forward Email Streamline Icon: https://streamlinehq.com

Follow Topic

Get notified by email when new papers are published related to Instanced Software Rasterizer.