CuRast: Cuda-Based Software Rasterization for Billions of Triangles

Published 23 Apr 2026 in cs.GR | (2604.21749v1)

Abstract: Previous work shows that small triangles can be rasterized efficiently with compute shaders. Building on this insight, we explore how far this can be pushed for massive triangle datasets without the need to construct acceleration structures in advance. Method: A 3-stage rasterization pipeline first rasterizes small triangles directly in stage 1, using atomicMin to store the closest fragments. Larger triangles are forwarded to stages 2 and 3. Results: CuRast can render models with hundreds of millions of triangles up to 2-5x (unique) or up to 12x (instanced) faster than Vulkan. Vulkan remains an order of magnitude faster for low-poly meshes. Limitations: We currently focus on dense, opaque meshes that you would typically obtain from photogrammetry/3D reconstruction. Blending/Transparency is not yet supported, and scenes with thousands of low-poly meshes are not implemented efficiently. Future Work: To make it suitable for games and a wider range of use cases, future work will need to (1) optimize handling of scenes with tens of thousands of nodes/meshes, (2) add support for hierarchical clustered LODs such as those produced by Meshoptimizer, (3) add support for transparency, likely in its own stage so as to keep opaque rasterization untouched and fast. Source Code: https://github.com/m-schuetz/CuRast

Abstract PDF Upgrade to Chat

Authors (4)

Summary

The paper introduces a CUDA-accelerated three-stage rasterization pipeline that efficiently renders billions of triangles in dense mesh scenes.
It achieves performance gains by adapting thread granularity to triangle size and using 64-bit atomicMin for effective depth buffering.
Results show up to 12× speed improvements in instanced geometry compared to Vulkan, proving its strength for photogrammetry and large-scale editing.

CuRast: CUDA-Based Software Rasterization for Billions of Triangles

Introduction

CuRast presents a CUDA-accelerated software rasterizer tailored for dense, mesh-heavy scenes containing hundreds of millions to billions of triangles. The motivation stems from the demonstrated inefficiency of hardware rasterization pipelines when processing highly detailed geometry with predominantly small, pixel-sized triangles—a scenario encountered frequently in photogrammetry, 3D reconstruction, and complex content creation workflows. By deliberately eschewing acceleration structures or hierarchical levels of detail (LOD), the pipeline can rapidly ingest, modify, and render vast datasets directly, avoiding the overhead and rigidity associated with preprocessing.

Architecture and Methodology

CuRast achieves performance gains through a three-stage rasterization pipeline that adapts its processing granularity to the size of each triangle. The stages are differentiated as follows:

Stage 1: Handles small triangles (<128 pixels in screen-space bounding box) using one thread per triangle.
Stage 2: Processes medium triangles (<4096 pixels) by assigning a warp (32 threads) per triangle.
Stage 3: For large triangles, fragments are further partitioned into 64×64 pixel tiles, each processed by a workgroup (64 threads).

This multi-stage approach is critical for balancing compute load and minimizing divergent execution (Figure 1).

Figure 1: Rasterization pipeline stages, colored by iteration counter, depict increasing thread parallelism for larger triangles.

Pipeline design relies on CUDA atomicMin for depth-buffering, extended to 64 bits for joint encoding of depth and triangle ID, eliminating explicit fragment sorting or inter-thread synchronization. Visibility buffering is employed in lieu of traditional deferred shading: instead of writing multiple per-pixel attributes, only the identifiers needed to resolve visible geometry for subsequent shading are stored (Figure 2).

Figure 2: Deferred shading produces G-Buffers, while visibility buffers—used by CuRast—contain only triangle indices per pixel.

The pipeline also integrates instancing (across Stage 1 only, to minimize register overhead), lightweight geometry compression (index bit-packing and 16-bit fixed-precision positions), and can operate directly on compressed or streamed mesh data.

Shading, Texture Fetch, and Mipmapping

The visibility buffer's global triangle index (36 bits) is resolved to mesh, instance, and triangle through a binary search on a prefix-sum array. Shading is deferred, with per-pixel passes retrieving vertex and texture data as required. Mip level determination for texture fetches is achieved by intersecting rays with the triangle plane at the pixel’s center and neighboring locations, then extrapolating UVs to compute the texture footprint (Figure 3).

Figure 3: Mip level estimation using projected pixel positions and extrapolated UVs after world-space ray-triangle intersection.

Performance Evaluation

CuRast is benchmarked on three NVIDIA GPUs (RTX 4070, 4090, 5090) and compared to Vulkan pipelines employing both standard indexed draws and programmable index pulling, with and without mesh optimization. The evaluation spans canonical scenes (Sponza), challenging photogrammetry datasets (Komainu Kobe, Venice), heavily instanced geometry (Lantern Instances), and the Zorah dataset with nearly 19 billion triangles post-frustum culling.

Key findings include:

For dense, high-poly scenes and large numbers of instances, CuRast outperforms Vulkan by 2-5× (unique geometry) and up to 12× (instanced geometry).
Vulkan is $1-2$ orders of magnitude faster in scenarios with many small, low-poly meshes—a known limitation given CuRast’s lack of acceleration structures and meshlet-like grouping.
Compression and mesh optimization have the greatest absolute effect on Vulkan, with meshoptimized draw calls sometimes doubling in efficiency. For CuRast, improvements are present but more modest.
The vast majority of triangles in massive datasets such as Zorah are efficiently processed in Stage 1 (Figure 4).
Figure 4: Top—bounding-box size per pixel's triangle; Bottom—contribution of each rasterization stage, indicating dominance of Stage 1 for dense geometry in Zorah.
Early culling of triangles that do not intersect any pixels provides substantial speedup (up to 40% faster in Zorah) without negatively impacting workloads dominated by large triangles.
Super sampling introduces a proportional cost increase in small-mesh scenes but only a moderate increase for massive dense scenes, as memory bandwidth is a limiting factor.

Strengths, Limitations, and Implications

CuRast establishes that for workloads characterized by a small number of large, dense, and predominantly opaque meshes—typical of photogrammetry, scientific imaging, and content creation—the compute-centric, atomic-min-based software rasterization paradigm is highly competitive.

The pipeline’s most notable strengths are:

Real-time rendering of up to 18.9 billion triangles, with rapid GPU-side data preparation and elimination of preprocessing bottlenecks.
Strong scaling with mesh instancing and minimal per-instance bandwidth usage.
Effective support for on-the-fly mesh modification and editing, valuable in interactive scientific and creative applications.

Yet limitations remain significant:

Poor performance for workloads with many small, low-density meshes, as commonly found in game engines.
Lack of transparency/blending support; current pipeline is for opaque geometry only.
Acceleration and LOD structures are not used, limiting application to games and scenarios requiring hierarchical simplification or efficient culling at scene scale.

Future Research Directions

Several avenues are highlighted for further development:

Integration of hierarchical/clustered LOD structures, as employed in Nanite and related meshlet systems, to adapt to game-like, multi-mesh scenes.
Support for blending, transparency, and advanced material systems in a manner compatible with the staged pipeline.
Adaptive scheduling strategies for rasterization, limiting per-thread processing to in-triangle fragments rather than bounding-box scans to optimize for sparse or clustered geometry.
GPU-side vertex reuse and meshlet-style vertex processing for improved memory locality and bandwidth efficiency.

Implementing these directions could lead to pipelines suited for general-purpose, real-time rendering engines, including gaming applications—bridging the gap between photogrammetric content editing and traditional graphics workloads.

Conclusion

CuRast demonstrates that compute-driven, atomic-min-based software rasterization on modern GPUs is a viable and highly performant alternative to hardware pipelines when rendering dense, high-triangle-count scenes. For applications involving massive, editable, or animated geometry without strict real-time constraints imposed by complex scene hierarchies, CuRast achieves substantial performance improvements, enabled by its three-stage pipeline, efficient visibility buffering, and flexible memory management. Realizing practical relevance for a wider spectrum of real-time graphics use cases will require further advances in meshlet-based organization, robust LOD techniques, and enhanced feature support.

Markdown Report Issue

Paper to Video (Beta)

No one has generated a video about this paper yet.

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Paper Prompts

Top Community Prompts

Explain it Like I'm 14

off on

Knowledge Gaps

off on

Practical Applications

off on

Glossary

off on

Conceptual Simplification

off on

Explain it Like I'm 14

What is this paper about?

This paper introduces CuRast, a super-fast way to draw 3D models made of triangles using the GPU in a different way than usual. Instead of relying on the GPU’s built‑in “drawing pipeline,” it uses general‑purpose GPU computing (CUDA) to “paint” triangles onto the screen with custom code. The goal is to render massive models—hundreds of millions to billions of triangles—in real time, without needing long, time‑consuming setup steps.

What questions did the researchers ask?

They mainly asked:

Can we render huge triangle models really fast using software on the GPU (CUDA), without building special helper structures (like LODs or spatial trees) ahead of time?
Can we make small, pixel‑sized triangles extremely fast while still handling medium and big triangles well enough?
Can we beat (or at least rival) the speed of modern graphics pipelines (like Vulkan) for the kinds of scenes that have tons of tiny triangles?

How does it work? (Simple explanation of the method)

Think of the screen as a giant grid of pixels. Each triangle is like a sticker you place on that grid. If two stickers overlap, the one closer to the camera should show.

CuRast “paints” triangles using three stages, choosing the best approach depending on how big each triangle looks on the screen:

Stage 1 (small triangles): One tiny worker handles one triangle at a time. This is super fast when triangles are tiny—like sprinkling pixel‑sized confetti.
Stage 2 (medium triangles): A small team of workers shares a triangle, splitting the work so it finishes quickly.
Stage 3 (large triangles): A larger team paints a triangle in chunks (tiles), making sure even giant triangles get done efficiently.

Key ideas explained simply:

“Closest pixel wins” rule (atomics): When two triangles try to paint the same pixel, the GPU uses a built‑in rule so only the closest one sticks. This avoids arguing and makes the process fast and safe.
“Visibility buffer”: Instead of storing full color and material details during painting, CuRast first stores “which triangle is in front for each pixel.” Later, in a separate pass, it looks up that triangle’s information (like texture) and shades the final color. This saves time and memory while drawing.
No prebuilt acceleration structures: Many systems need hours of pre‑processing to build special data for skipping hidden details. CuRast skips that step, so it’s great for models that change often (like when editing or animating).
Instancing: If the same object appears many times (like a lantern placed thousands of times), CuRast loads its triangle data once and reuses it, saving bandwidth.
Compression: To fit giant scenes in GPU memory, positions are stored in 16‑bit fixed precision (good enough for millimeter‑level detail in many cases), and triangle indices are packed tightly so they take up fewer bits.

What did they find?

For huge, dense models (lots of tiny triangles), CuRast is much faster than the standard Vulkan pipeline:
- About 2–5× faster for unique (non‑repeated) geometry.
- Up to about 12× faster when the same mesh is reused many times (instanced).
For simple, low‑poly scenes (like the Sponza demo with big triangles and many separate parts), Vulkan is still much faster—about 7–13× faster. CuRast focuses on heavy, detailed geometry, not lightweight scenes with many small objects.
Real‑world stress test: They rendered a massive scene (“Zorah”) with 18.9 billion triangles total (about 13.5 billion visible) at 4K resolution in roughly 67 milliseconds per frame on a top‑end GPU. The data (38.8 GB) was compressed to 21.7 GB and loaded to the GPU in around 6.6 seconds.
Mesh optimization tools (which reorder triangles/vertices to improve memory access) help both Vulkan and CuRast, but they help Vulkan more. Even so, for big dense scenes CuRast remains faster.
On the very latest GPUs, some traditional methods got faster, but CuRast still shines for the target cases.

Why this matters:

It proves that a compute‑based, software‑style renderer can be ridiculously fast for certain types of scenes—especially scans or reconstructions with tons of tiny triangles—without the usual heavy pre‑processing.

Why is this important? (Implications and impact)

Great for 3D scans, photogrammetry, and content creation tools: If you’re reconstructing real‑world objects from photos or editing huge meshes in tools like Blender, you often change the model a lot. CuRast avoids slow pre‑processing, so you can edit and view massive models more interactively.
Handles billions of triangles in real time: This opens the door to smoother workflows for massive datasets that used to be too slow or needed lots of prep work.
Not ready for all games (yet): Typical games have thousands of small, separate objects and often need transparency and special effects. CuRast currently focuses on dense, opaque meshes and doesn’t support transparency.
Future improvements planned:
- Better handling of scenes with tens of thousands of separate meshes.
- Support for efficient “levels of detail” (so distant objects cost less to draw).
- Support for transparency without slowing down opaque rendering.

In short: CuRast shows that for the right kind of scenes—huge, dense triangle data—it’s possible to draw incredibly complex models at high speed using a custom CUDA pipeline, sometimes far faster than traditional graphics pipelines. This can make building and exploring massive 3D worlds much more practical.

View Paper Prompt View All Prompts

Knowledge Gaps

Below is a concise, actionable list of knowledge gaps, limitations, and open questions left unresolved by the paper. Each point highlights a concrete direction for future research or engineering work.

Transparency and blending are unsupported; identify and implement an order-independent transparency strategy (e.g., weighted blended OIT, per-pixel linked lists, A-buffer) compatible with 64-bit atomicMin and quantify performance/memory trade-offs.
Poor scaling for scenes with tens of thousands of low-density meshes; design a pipeline that efficiently handles many small drawables (batching, per-mesh/instance queues, GPU-side culling, and scheduling).
Lack of hierarchical LOD integration; develop clustered/meshoptimizer-style LOD selection and switching policies suitable for a compute rasterizer, including dynamic updates without costly preprocessing.
Large-triangle handling in Stage 3 relies on per-pixel world-space ray-triangle tests; explore more efficient coverage methods (clipping, edge functions with robust near-plane handling, tile-edge coverage tests) and assess accuracy/performance.
Near-plane clipping is only approximated (accurate bbox only in Stage 2, no real clipping); implement robust triangle clipping across stages and evaluate its cost vs. Stage 3 fallback.
Stage thresholds (≤128 px small, ≤4096 px medium) are heuristic; devise adaptive, hardware- and workload-aware thresholding and tile sizing, with a self-tuning strategy based on runtime telemetry.
Visibility buffer bit allocation (28-bit depth, 36-bit triangle ID) is fixed; analyze precision/robustness across extreme depth ranges, sensitivity to z-fighting, and alternative encodings (log depth, configurable bit budgets, 96/128-bit atomics when available).
Resolve pass locality is potentially poor (per-pixel binary search into a global prefix sum and random triangle fetches); investigate cache-friendly screen- or tile-space reordering, per-tile compaction, two-level indexing (mesh ID + local tri ID), and other data layouts to improve memory coherence.
Compute cost in resolve is high for mip selection (four ray-plane intersections per pixel); explore cheaper approximations (screen-space derivatives, precomputed per-triangle gradients, hybrid rules near the near plane) with error/perf analysis.
Anti-aliasing is not addressed; evaluate MSAA-like coverage masks, stochastic AA, or temporal AA integration with visibility buffers and atomicMin constraints.
No anisotropic filtering or ddx/ddy-based texture filtering support; determine how to compute reliable derivatives in a compute/visibility-buffer pipeline and compare with hardware samplers.
No occlusion culling beyond atomicMin; examine hierarchical Z (HZB) or coarse depth tiles to skip rasterization of fully occluded triangles, especially in high overdraw scenes.
Load balancing and contention under extreme overdraw are not characterized; measure atomicMin contention patterns and test triangle ordering/screen-tiling strategies to reduce hot-pixel conflicts.
Instancing is only optimized in Stage 1 and increases register pressure; design low-overhead instancing paths across stages (e.g., per-triangle shared data, instance bucketing, or kernel variants) and GPU-side per-instance frustum culling.
Compression choices (16-bit fixed-point positions, variable-bit indices) lack a quality/error study; quantify geometric error, visual artifacts (especially near the camera), and the impact on shading/LODs; explore adaptive precision and on-the-fly encode/decode costs.
Out-of-core streaming is briefly demonstrated (6.6 s load+compress), but no runtime streaming/eviction policy is presented; design asynchronous streaming, prioritization, and residency management for geometry and textures under tight VRAM budgets.
Portability is limited to CUDA/NVIDIA; assess feasibility and performance on other backends (HIP, Vulkan/DirectX compute, Metal) and the availability/speed of 64-bit atomics across vendors.
Evaluation scope omits comparisons with other visibility-buffer or software rasterizers (e.g., CUDARaster/Nanite-like paths) and lacks a Vulkan visibility-buffer baseline; conduct fair, like-for-like comparisons including resolve costs.
Energy efficiency and thermal behavior under heavy atomics/compute are unmeasured; profile power/perf (perf/W) and explore energy-aware scheduling.
Scalability with framebuffer resolution (e.g., 8K) and multi-sampling is not evaluated; characterize performance/memory scaling and adapt visibility buffer formats accordingly.
CPU/GPU division of labor (CPU-side frustum culling, host-side mesh prefix sums) may bottleneck scenes with many nodes; prototype GPU-based culling and prefix-sum computation, and quantify latency/overhead trade-offs.
Determinism and tie-breaking with equal-depth fragments are unspecified; define a deterministic policy (e.g., triangle ID priority) and assess artifact risks with quantized depth.
Animated/skinned meshes are not tested; extend pipeline to dynamic per-vertex transforms, evaluate performance for animated content, and study implications for any future LOD/instancing schemes.
Texture subsystem details (JPEG decode in CUDA) lack a general framework; generalize to diverse compressed formats (BCn/ASTC/ETC), explore hardware-accelerated decode paths, and quantify decode overhead vs. sampling quality.
Memory footprint of visibility buffers at high resolutions and with MSAA is not discussed; analyze VRAM usage and evaluate compression or alternative encodings to keep memory in check.
Mesh/triangle ordering for CuRast is only modestly explored; investigate screen-space or tile-space orderings tailored for atomicMin to maximize cache coherence and minimize pixel contention.
Stage 3 tile size (64×64) is fixed; study adaptive tile sizes and workgroup shapes relative to triangle size/aspect to reduce underutilization and improve coherence.
Robustness to degenerate/near-degenerate triangles and extreme aspect ratios is not examined; implement guards and test numerically stressful cases (very thin triangles, huge depth ranges).
Integration with advanced lighting (shadows, PBR, deferred/clustered lighting) is not detailed; design a shading pipeline that marries visibility buffers with modern material graphs and measure end-to-end performance.

View Paper Prompt View All Prompts

Practical Applications

Immediate Applications

These applications can be deployed now using the paper’s open-source code (CuRast) and the described methods, with light engineering for integration.

Massive photogrammetry/scan viewer for dense opaque meshes
- Sectors: VFX, cultural heritage, surveying, digital twins (urban), AEC field capture
- What: A workstation or kiosk viewer that loads tens of GB of dense triangle meshes from NVMe, compresses/quantizes on the fly, and renders billions of triangles in real time without precomputed LODs or acceleration structures.
- Tools/workflows: Standalone CuRast-based “GigaMesh” viewer or a DCC plug-in (e.g., Blender viewport mode) that uses the 3-stage rasterization, visibility buffer, and compression (variable-bit indices, 16-bit positions).
- Assumptions/dependencies: NVIDIA CUDA-capable GPU with fast 64-bit atomics (RTX-class recommended), fast NVMe SSD, opaque geometry (no transparency), scenes skewed toward dense meshes with pixel-size triangles; VRAM capacity dictates compression and streaming strategy.
On-set/field QC for 3D scanning and virtual production
- Sectors: Virtual production, AEC site capture, UAV photogrammetry
- What: Immediate visual validation of high-res scans (e.g., 20–40 GB) on a laptop/workstation in seconds (e.g., 6–7 s in the paper), enabling rapid decision-making without long LOD preprocessing.
- Tools/workflows: Integrate CuRast into scanning pipelines (RealityCapture/RealityScan → CuRast QC viewer) for instant preview before further processing.
- Assumptions/dependencies: Same as above; can tolerate lack of transparency; field hardware must have a capable NVIDIA GPU.
Server-side high-throughput renders for asset catalogs and archives
- Sectors: 3D marketplaces, museums/archives, digital twin platforms
- What: Headless, containerized microservice that renders turntables/thumbnails or on-demand views of massive scanned assets; avoids LOD building and reduces operational latency.
- Tools/workflows: CuRast as a service (gRPC/REST) rendering to images/streams; frame streaming via WebRTC/VNC; autoscaling per GPU.
- Assumptions/dependencies: NVIDIA GPUs in the datacenter; opaque models preferred; request routing and caching around camera paths; bandwidth to transfer frames.
Instanced-scene preview at extreme scale
- Sectors: DCC set dressing, vegetation scattering, product catalogs, manufacturing
- What: Interactive preview of scenes with millions of triangles per mesh and thousands of instances (billions of triangles total), benefiting from the stage-1 instancing optimization.
- Tools/workflows: CuRast instancing path for one-to-many transforms; scene graph that submits one geometry with many instances.
- Assumptions/dependencies: Best when each base mesh is dense; identical geometry across instances; opaque materials.
Synthetic data generation (depth and IDs) from large scans
- Sectors: Robotics, autonomous driving, computer vision
- What: Rapid generation of ground-truth depth and per-triangle/per-instance IDs via the visibility buffer for segmentation, occlusion, and depth maps from dense environments.
- Tools/workflows: Render scripted camera paths to depth/ID buffers; map triangle IDs to instance/semantic labels; export masks and metadata.
- Assumptions/dependencies: Opaque surfaces; semantic ID mapping pipeline; NVIDIA GPU.
Texture-on-demand via in-kernel JPEG decoding for large textures
- Sectors: Graphics software, digital twins, visualization platforms
- What: Render scenes with very high-resolution textures that don’t fit in VRAM by decoding only visible JPEG blocks during the resolve pass.
- Tools/workflows: Integrate CuRast’s JPEG-block indexing/decoding in shading; pair with visibility buffer to fetch only needed texels.
- Assumptions/dependencies: Textures stored as JPEG; CUDA-capable JPEG decode on GPU; minor latency from per-block decode acceptable.
Asset packaging with lightweight geometry compression
- Sectors: Content pipelines, 3D data distribution
- What: Reduce memory and I/O by quantizing positions to 16-bit fixed precision per-mesh and compressing indices to the minimum bit-width required by index span.
- Tools/workflows: Pre-export step that computes per-mesh bounds and index spans; stores scales/offsets for decode in kernels; optional dual-path kernels (compressed/uncompressed).
- Assumptions/dependencies: Acceptable precision loss (e.g., 1 mm for 65 m span); encode/decode logic integrated into tools; compressed indices not directly compatible with standard 16/32-bit indexed draws.
Teaching and research testbed for software rasterization at scale
- Sectors: Academia, training, graphics R&D
- What: Use CuRast as an open, reproducible platform to study atomics-based rasterization, visibility buffers, persistent kernels, and mip mapping in world space on massive datasets.
- Tools/workflows: Course labs comparing compute vs hardware pipelines; research into load balancing, culling, compression.
- Assumptions/dependencies: NVIDIA GPUs; access to large datasets; tolerance for opaque-only constraints.

Long-Term Applications

These applications require additional R&D (e.g., scaling to many low-density meshes, transparency, cross-vendor support) before broad deployment.

Game engine integration for micropolygon-heavy content
- Sectors: Gaming, real-time engines
- What: A Nanite-like path that uses CuRast’s compute rasterizer for pixel-sized triangles but adds support for hierarchical clustered LODs, scene graphs with tens of thousands of nodes, and a dedicated transparency stage.
- Tools/workflows: Hybrid renderer switching between compute rasterization and hardware depending on triangle size; vis-buffer resolve with material graph integration.
- Assumptions/dependencies: New stage for transparency; cluster/LOD build pipeline (e.g., Meshoptimizer clusters); cross-vendor compute backend (Vulkan/DX12/HIP).
CAD/BIM/GIS visualization with many low-density meshes
- Sectors: AEC, GIS, infrastructure digital twins
- What: Extend pipeline scalability to tens of thousands of coarse meshes (buildings, parts) via improved batching, node-level culling, and coarse-grained scheduling.
- Tools/workflows: Scene-graph aware persistent kernels; batched draws by material/mesh families; hierarchical culling.
- Assumptions/dependencies: Algorithmic changes to address current scaling limitations for low-density mesh counts; feature parity with CAD materials; transparency for glass.
AR/VR viewing of city-scale scans
- Sectors: AR/VR, urban planning, digital heritage
- What: Low-latency, foveated compute rasterization for photogrammetric cities in headsets; streaming high-detail periphery on demand.
- Tools/workflows: Eye-tracking–driven variable-rate shading in compute; asynchronous time-warp; GPU-side LOD streamers.
- Assumptions/dependencies: Significant optimization for latency; mobile/standalone headsets require non-CUDA paths; transparency and antialiasing support.
Hybrid renderers combining compute, hardware rasterization, and splats
- Sectors: Game/film engines, visualization
- What: Dynamically choose between CuRast for dense triangles, hardware for large/low-density meshes, and 3D Gaussian splats/points for certain content.
- Tools/workflows: Runtime scheduler based on primitive size and material; unified visibility and shading interfaces.
- Assumptions/dependencies: Complex orchestration and compositing; consistent material/shading across paths; transparency handling.
Multi-GPU and distributed trillion-triangle visualization
- Sectors: Cloud visualization, HPC, national archives
- What: Partition screens/geometry across GPUs or nodes (sort-first/last) with parallel visibility buffers and fast compositing; interactive city or country-scale models.
- Tools/workflows: NVLink-aware tiling; RDMA-based compositing; cluster scheduling; out-of-core geometry streaming.
- Assumptions/dependencies: High-bandwidth interconnects; robust load balancing; fault tolerance.
Sensor-accurate simulators for robotics and AV
- Sectors: Robotics, autonomous systems, simulation
- What: Extend resolve stage to simulate LiDAR returns, RGB-D sensors, and material-dependent effects over dense meshes; generate multi-modal ground truth.
- Tools/workflows: Add per-hit material queries, BRDFs, and temporal effects; integrate translucency and reflective materials.
- Assumptions/dependencies: Transparency/blending stage; physically based shading; deterministic timing for closed-loop sim.
Public-sector remote visualization portals for massive 3D scans
- Sectors: Government, policy, cultural heritage access
- What: Citizen-facing portals that stream interactive views of large 3D reconstructions (cities, artifacts) rendered server-side on GPUs.
- Tools/workflows: Cloud-hosted CuRast clusters; access control, caching, and progressive camera-path streaming.
- Assumptions/dependencies: Privacy and data governance; operational costs; network bandwidth; multi-tenant fairness.
Cross-vendor portability and mobile-class deployment
- Sectors: Software, mobile XR
- What: Port the pipeline to Vulkan/DX12 compute or HIP to support AMD and mobile GPUs; explore subgroup operations for atomics and depth updates.
- Tools/workflows: Backend-abstraction layer; performance tuning per architecture; feature probes for 64-bit atomics and subgroup capabilities.
- Assumptions/dependencies: Platform differences in atomicMin performance; potential redesign for memory and wavefront sizes.
High-throughput offline baking and map generation
- Sectors: Games, VFX, DCC
- What: Use the compute rasterizer to accelerate baking (normals, AO, ID maps) from ultra-high-poly sources onto target assets without LOD preprocessing.
- Tools/workflows: Batch baking service that leverages visibility buffer and world-space shading; integration with DCC bake pipelines.
- Assumptions/dependencies: Additional passes for AO/occlusion rays; UV/mip control; non-opaque effects may be needed.
Incremental editing and streaming of massive meshes
- Sectors: DCC tools, CAD
- What: Editing sessions on huge meshes where geometry changes invalidate LODs—CuRast’s no-precompute pipeline enables immediate viewport updates.
- Tools/workflows: DCC integration that streams deltas; per-mesh recompaction of compressed buffers; fast re-culling.
- Assumptions/dependencies: Efficient editor APIs; improved handling of many scene nodes; optional transparent overlays for selections.

Notes on feasibility across all applications:

Best performance arises with dense, opaque, triangle-heavy content and significant instancing; scenes dominated by many low-density meshes or transparency currently underperform.
NVIDIA CUDA dependency is a near-term constraint; cross-vendor adoption needs a compute-port (e.g., Vulkan/DX12/HIP) with comparable 64-bit atomic performance.
IO and memory bandwidth (NVMe → VRAM) are critical; benefits grow with faster SSDs, larger VRAM, and high memory bandwidth GPUs.

View Paper Prompt View All Prompts

Glossary

3D Gaussian Splatting (3DGS): A point-based rendering method that represents scenes with 3D Gaussian primitives and blends them in screen space for high-quality reconstructions. "3DGS proposes using translucent gaussian primitives for scene reconstruction from photos, resulting in high-quality models of the real world."
acceleration structures: Spatial data structures (e.g., BVH, grids) built to speed up rendering or intersection queries. "without the need to construct acceleration structures in advance."
atomic-CAS: An atomic compare-and-swap operation used to implement locking or synchronization on the GPU. "first attempts to lock it with atomic-CAS, then updates and subsequently unlocks it."
atomicMin: An atomic minimum operation that safely updates a shared value (e.g., depth) with the smallest of competing writes. "using atomicMin to store the closest fragments."
backface culling: Skipping triangles whose orientation faces away from the camera to reduce rasterization work. "followed by performing backface culling."
barycentric coordinates: A coordinate system for points inside a triangle, used for interpolation and inside-triangle tests. "computes the barycentric coordinates of the triangle"
coalesced memory access: GPU-friendly memory access pattern where adjacent threads access adjacent addresses to maximize bandwidth. "promote coalesced memory access"
compute shaders: GPU programs executed in the compute pipeline, often used for general-purpose or custom rasterization tasks. "Previous work shows that small triangles can be rasterized efficiently with compute shaders."
cooperative groups: CUDA programming construct enabling synchronization and coordination among thread groups. "persistent-kernel launch using cooperative groups"
deferred rendering: A pipeline that first writes per-pixel attributes to G-Buffers, then performs lighting in a later pass. "such as deferred rendering with rich G-Buffers."
deferred shading: Performing the shading/lighting step after geometry rasterization using stored per-pixel attributes or IDs. "Visibility buffer rendering and deferred shading with G-Buffers are closely related."
depth buffer: A per-pixel buffer storing depth values to resolve visibility during rasterization. "the depth buffer"
exclusive prefix sum: A scan operation producing cumulative totals where each output excludes the current element; used for indexing. "compute a cumulative triangle count for each mesh and instance (i.e. the exclusive prefix sum)"
eye-dome lighting: A screen-space shading technique that enhances edge contrast in dense geometry or point clouds. "with screen space ambient occlusion and eye-dome lighting enabled."
frustum culling: Discarding objects or triangles outside the camera’s view volume to avoid unnecessary work. "13.5 billion triangles visible after frustum culling."
G-Buffers: Geometry buffers storing per-pixel attributes (normals, albedo, etc.) for deferred shading. "Deferred shading creates G-Buffers with various attributes that may be needed for the shading pass."
GPGPU: General-purpose computing on graphics processing units, using GPUs for non-graphics workloads. "GPGPU-accelerated software rasterization"
hierarchical levels of detail (LOD): Multi-resolution representations used to render simpler versions of distant or small objects. "without the need to create spatial acceleration structures or hierarchical levels of detail in advance."
instancing: Rendering many copies of the same mesh efficiently by reusing geometry with different transforms. "Instancing allows us to reduce this bandwidth bottleneck"
item buffers: Early term for visibility buffers that store per-pixel IDs of first-hit scene elements (in ray tracing). "Visibility buffers were initially introduced under the term item buffers in the context of ray tracing"
JPEG-compressed textures: Textures stored in JPEG format and decoded at runtime to reduce memory footprint and bandwidth. "j: JPEG-compressed textures."
meshoptimizer: A toolkit for optimizing mesh data for locality, cache efficiency, and vertex reuse. "Meshoptimizer is an open source project that implements strategies that promote locality and vertex reuse"
micropolygons: Very small triangles, often around pixel-sized, that stress rasterization throughput. "pure micropolygon geometry."
mipmap level: The selected resolution level in a mipmapped texture that best matches the pixel’s texture footprint. "determine the mipmap level"
near plane: The closest clipping plane of the camera frustum; geometry intersecting it needs clipping. "intersects the near plane"
normalized device coordinates (NDC): The post-projection coordinate space where x,y,z are normalized before viewport transform. "in normalized device coordinates"
persistent kernel: A GPU kernel that runs persistently, fetching tasks dynamically to improve load balancing. "persistent-kernel launch"
perspective-correct interpolation: Interpolation of attributes accounting for perspective division to avoid distortion. "For efficient perspective-correct interpolation of depth values"
photogrammetry: Reconstructing 3D geometry from photographs, often producing dense, detailed meshes. "photogrammetry reconstruction"
programmable index pulling: Fetching indices in shader code (rather than fixed-function index buffers) for flexible formats or compression. "uses programmable index pulling and vertex pulling."
ray cast: Tracing a ray from the camera through a pixel to find intersections with geometry. "performs a ray cast in world space"
ray-triangle intersections: Computing whether and where a ray hits a triangle, typically used for visibility or shading. "ray-triangle intersections in world space"
resolve pass: A full-screen pass that uses stored per-pixel IDs to fetch attributes and perform shading. "in the full-screen resolve pass"
screen space ambient occlusion (SSAO): A technique estimating ambient occlusion using depth information in screen space. "with screen space ambient occlusion and eye-dome lighting enabled."
sort-middle: A graphics pipeline organization that sorts primitives between geometry and rasterization stages. "via a sort-middle approach"
spin loops: Busy-wait loops used in synchronization patterns, e.g., for atomic locking on pixels. "based on spin loops"
Sutherland-Hodgman algorithm: A polygon clipping algorithm used to clip triangles against the view frustum. "via the Sutherland-Hodgman algorithm."
vertex pulling: Fetching vertex attributes directly in shaders, bypassing fixed-function vertex fetching. "uses programmable index pulling and vertex pulling."
vertex reuse: Reusing transformed vertices to reduce redundant processing and memory access. "vertex reuse mechanisms"
visibility buffer: A buffer storing per-pixel IDs (mesh/triangle) of the visible surface, deferring attribute fetch and shading. "Visibility buffers only store triangle indices."
warp: A group of 32 threads executing in lockstep on NVIDIA GPUs. "Launch a warp (32 threads) per triangle"
workgroup: A set of threads launched together that can cooperate via shared memory and synchronization. "one workgroup comprising 64 threads"

View Paper Prompt View All Prompts

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Generate Now

Continue Learning

Collections

GitHub

GitHub - m-schuetz/CuRast: Cuda-Based Software Rasterization for Billions of Triangles · GitHub (64 stars)

Tweets

HackerNews

CuRast: CUDA-Based Software Rasterization for Billions of Triangles (3 points, 0 comments)

CuRast: Cuda-Based Software Rasterization for Billions of Triangles

Summary

CuRast: CUDA-Based Software Rasterization for Billions of Triangles

Introduction

Architecture and Methodology

Shading, Texture Fetch, and Mipmapping

Performance Evaluation

Strengths, Limitations, and Implications

Future Research Directions

Conclusion

Paper to Video (Beta)

Whiteboard

Paper Prompts

Top Community Prompts

Explain it Like I'm 14

What is this paper about?

What questions did the researchers ask?

How does it work? (Simple explanation of the method)

What did they find?

Why is this important? (Implications and impact)

Knowledge Gaps

Practical Applications

Immediate Applications

Long-Term Applications

Glossary

Open Problems

Continue Learning

Collections

GitHub

Tweets

HackerNews

Don't miss out on important new AI/ML research

Sign up for free to explore the frontiers of research