LucidRaster: GPU Rasterizer for Exact OIT

Updated 2 December 2025

LucidRaster is a software-based GPU rasterizer designed for exact order-independent transparency, eliminating artifacts common in hardware blending.
It employs a novel two-stage sorting technique within Vulkan compute shaders, leveraging optimized parallel data structures to handle high triangle density and depth complexity.
The system achieves high fidelity and speed, outperforming methods like MBOIT and MLAB while optimizing GPU memory usage and computational overhead.

LucidRaster is a fully software-based GPU rasterizer developed for exact, artifact-free order-independent transparency (OIT) in real-time graphics. Designed to address the intrinsic inefficiencies and correctness trade-offs found in prior transparency solutions, LucidRaster operates within the Vulkan compute shader framework and introduces a novel two-stage sorting technique, coupled with optimized parallel data structures, to efficiently handle high triangle density and depth complexity. The system achieves greater fidelity and speed relative to traditional approximations, delivering exact OIT at approximately three times the hardware alpha-blending cost and outperforming established methods such as MBOIT and MLAB in high-fidelity transparent rendering scenarios (Jakubowski, 22 May 2024).

1. Pipeline Architecture

LucidRaster's core pipeline employs a three-stage sort-middle design optimized for the hierarchical, parallel nature of modern GPUs:

Quad Setup: Scene geometry decomposed into quads (pairs of triangles) undergoes backface culling, frustum clipping, and classification as "small" or "large". Each quad's geometric and shading attributes (edge functions, normals, UVs) are stored in cache-friendly buffers. Work is distributed into 1024-element batches per workgroup, leveraging a "filter + compact" phase to cheaply discard invisible primitives.
Bin Construction: The render target is subdivided into 32×32 pixel bins. Primitive overlaps per bin are determined in a two-phase process (approximate for small, scanline-exact for large), followed by prefix-sum bin offset computation and density classification. Per-bin primitive lists are constructed in global memory, supporting high-throughput sort-middle processing.
Bin Rasterization: Each bin is rasterized by persistent-thread workgroups. Low- and high-density bins are processed by different thread counts (256 and 1024, respectively). Bins undergo recursive subdivision (32×8 block-rows, 8×8 blocks, 8×4 half-blocks), with per-block and per-pixel sorting to guarantee exact fragment composition order for transparency.

This approach contrasts with hardware blending (lacking any sort, resulting in overdraw artifacts) and with single-pass OIT approximations (such as WBOIT and MBOIT), by delivering guaranteed correct compositing without splits or peeling passes, at an optimized storage and sorting cost (Jakubowski, 22 May 2024).

2. Two-Stage Sorting Technique

LucidRaster’s fundamental advance is its two-stage sorting strategy for transparent fragment ordering:

Block-Level Sorting: Within each 8×8 pixel block, a fixed thread subgroup builds a list of overlapping triangle "block-rows", projecting them to "tri-blocks". Each tri-block's centroid depth is precomputed and compactly encoded with its index. A shared-memory bitonic sort (complexity $O(b \log b)$ , $b \leq 256$ for the low-rasterizer) ensures in-block fragments are ordered by depth for subsequent processing.

Per-Pixel Depth-Filter: Post block-sort, each 8×4 half-block is processed per-pixel in fragments (“samples”) indexed in block-sorted order. Each thread maintains a local min-heap "depth-filter" of capacity $k$ (default $k=3$ ), incrementally composing samples in front-to-back order. When the filter exceeds $k$ , the frontmost sample is composed, and the filter continues with the next segment. At the end, remaining samples are flushed and the composite color is output.

This process yields exact front-to-back alpha compositing:

$C_{\text{out}} = \sum_{i=1}^n \left(\alpha_i c_i \prod_{j=1}^{i-1}(1-\alpha_j)\right), \qquad \alpha_{\text{out}} = 1-\prod_{i=1}^n (1-\alpha_i)$

If $k$ is at least the number of fragments per pixel, the method is provably exact; $k=3$ suffices for $>99.5\%$ of pixels in tested scenes, with fallback strategies available if strict correctness is required (Jakubowski, 22 May 2024).

3. GPU Implementation and Data Layout

The implementation optimizes both data movement and parallel execution:

Data Storage: Separate, coalescent buffers are used for per-triangle (normals, edge functions, depth coefficients) and per-quad (AABBs, vertex colors/UVs) attributes.
Bin Metadata: Metadata includes per-bin triangle and quad counts, exclusive prefix sums for offset computation, and a flat index buffer.
Scratch Buffers: Low-rasterizer workgroups utilize 64 KB global scratch space (up to 1024 tri-block-rows and tri-half-blocks), while high-rasterizer groups use up to 768 KB (for up to 128K).
Threading Model: Persistent threading per Gupta (2012) enables workgroups to iteratively fetch and process bins/batches. Shared-memory atomics manage prefix sums and allocations. Vulkan subgroup shuffles optimize data exchange for sorting.
Hardware-Specific Optimizations: Utilization of quad primitives reduces packetization overhead. Subgroup size control (VK_EXT_subgroup_size_control) matches hardware lane widths. An early-alpha-stop ('alpha_threshold') allows the system to discard occluded samples efficiently when cumulative transmittance falls below $1 - \frac{1}{128}$ .

4. Computational Complexity and Memory Usage

Let $T$ be the number of triangles, $B$ the number of bins, $D_b$ depth-complexity in bin $b$ , $M$ the maximum block size ( $\leq 256$ ), and $S$ the total screen samples. The total time per frame is bounded by:

$O(T)~(\text{setup}) + O(T+B)~(\text{binning}) + \sum_{b=1}^B \left[O(D_b)~(\text{samples}) + O \left(\frac{D_b}{M} \cdot M \log M \right)~(\text{block sorts}) + O(D_b k \log k)~(\text{pixel filtering})\right]$

Assuming uniform complexity, this reduces to:

$O(T + S\log M + S k\log k)$

Memory footprint comprises $O(T)$ for geometry, $O(B)$ for bin metadata, and per-workgroup scratch. Peak usage for 12 million triangle scenes is approximately 1.2 GB on high-end GPUs (Jakubowski, 22 May 2024).

5. Empirical Performance and Comparison

LucidRaster achieves a significant balance of accuracy and speed:

Scene	Samples (M)	SW (µs)	HW (µs)	SW/HW
Boxes	12.1	76	17.9	4.3×
Bunny	0.74	12.3	3.3	3.7×
Conference	5.3	86	16.8	5.1×
Dragon	0.67	4.2	1.25	3.4×
White Oak	122.0	58.0	27.0	2.1×
Average	—	—	—	3.3×

For context, compared to hardware alpha-blending (baseline 1.0×), recent OIT approximations have the following speed ratios:

WBOIT: 1.17×
MBOIT (6 moments): 3.13×
MLAB (4 layers): 4.75×
LucidRaster: 3.30×

The alpha-threshold early-out increases performance up to 25% for heavily occluded scenes. The system's time breakdown allocates 14% to setup, 5% to binning, 60% to low-rasterization, and 18% to high-rasterization (Jakubowski, 22 May 2024).

6. Extensions, Limitations, and Future Directions

Memory Usage: High memory consumption may be mitigated with dynamic allocation or sparse buffer strategies, potentially reducing usage by 2–4×.
Depth-Filter Parameter: The default $k$ trades register pressure for accuracy; fallback to an exact peel can guarantee complete correctness for rare overflow cases ( $<0.2\%$ pixels).
Multisample Anti-Aliasing (MSAA): Not integrated in current design; supporting e.g., 4×/16× MSAA will require proportional increases in metadata.
Adaptive Binning: Dynamic bin sizing (e.g., 64×64 in sparse, 16×16 in dense regions) and concurrent use of low/high rasterizers are proposed to optimize parallel occupancy.

7. Relevance to Compact Raster Time Series

LucidRaster’s exact, bin-based transparent compositing is complementary to advances in compact raster representations for temporal data. The k³-tree approach (as in “A Compact Representation of Raster Time Series” (Cruces et al., 2019)) supports efficient spatiotemporal queries and highly compressed storage by leveraging spatial and temporal locality in multi-layer raster data. Integration of k³-trees could enable LucidRaster to:

Store sequences of transparency layers or complex scenes with near-optimal bit-compactness.
Offer O(log $_k$ N + log $_k$ σ + log $_k$ T) query time for direct spatio-temporal access without decompressing entire rasters.
Scale efficiently for applications in domains such as weather, satellite image analysis, or film post-production, where high temporal resolution and large spatial extents are commonplace (Cruces et al., 2019).

This suggests a unified approach where LucidRaster's compute-based sorting and k³-tree-accelerated storage/indexing can together provide exact transparency compositing and highly scalable, query-efficient access for time-varying raster datasets.

PDF Markdown Chat (Pro)

References (2)

LucidRaster: GPU Software Rasterizer for Exact Order-Independent Transparency (2024)

A Compact Representation of Raster Time Series (2019)

Whiteboard

Generate a whiteboard explanation of this topic.

Follow Topic

Get notified by email when new papers are published related to LucidRaster.