Papers

Topics

Authors

Recent

View all

Gemini 2.5 Flash

173 tokens/sec

GPT-4o

7 tokens/sec

Gemini 2.5 Pro Pro

46 tokens/sec

o3 Pro

4 tokens/sec

GPT-4.1 Pro

38 tokens/sec

DeepSeek R1 via Azure Pro

28 tokens/sec

2000 character limit reached

8-Wide BVH Ray Stream Implementation

Updated 30 June 2025

8-Wide BVH Ray Stream Implementation is a method that uses quantized BVH structures, fixed-point arithmetic, and ray stream traversal to enhance ray tracing performance.
It groups rays into streams that share traversal stacks, reducing memory traffic and ensuring robust geometric correctness through conservative rounding.
This approach cuts memory traffic to just 18% of traditional BVH methods, making it ideal for bandwidth-limited and SIMD-optimized hardware environments.

An 8-wide BVH (Bounding Volume Hierarchy) ray stream implementation is a memory-efficient, SIMD-friendly approach to accelerating ray tracing workloads, especially in environments where memory bandwidth is a limiting factor. It integrates quantized BVH and primitive representations, stream-based traversal (processing groups of rays that share traversal state), and fixed-point arithmetic for intersection routines. This design is particularly amenable to modern hardware, enabling significant reductions in memory traffic—down to 18% of conventional BVH approaches with 8-bit quantization—while robustly handling geometric correctness and throughput requirements.

1. Quantized BVH and Primitive Structures

The 8-wide BVH structure consists of nodes with up to eight children, where all bounding box coordinates and primitive (triangle) vertices are quantized to 8-bit fixed-point representations in local coordinate frames.

Node Quantization:
- Each node defines a local grid: origin (full-precision integer) and per-axis scale factors (power-of-two, 8 bits).
- Child bounding boxes are stored as 8-bit unsigned integers in node-local space:
$scale_{axis} = \left\lceil \log_2 \frac{maxBounds_{axis} - minBounds_{axis}}{2^{8} - 1} \right\rceil$

$lo_{axis} = \left\lfloor \frac{p_{lo, axis} - origin_{axis}}{2^{scale_{axis}}} \right\rfloor, \quad hi_{axis} = \left\lceil \frac{p_{hi, axis} - origin_{axis}}{2^{scale_{axis}}} \right\rceil$ - Leaf node triangles are similarly quantized to the local frame using 8 bits per coordinate.
Storage Efficiency:
- Node bounds: 48 bytes (quantized) vs. 192+ bytes (float).
- Compressed node: 96 bytes vs. 228 bytes (uncompressed).
- Triangle: 9 bytes (quantized) vs. 36 bytes (uncompressed).

This quantization maps both parent-child bounding box relationships and leaf primitives to a uniform, compact representation. Quantization is designed to strictly avoid geometric cracks by enforcing conservative rounding rules; for leaf triangles, the maximal quantization gap across the BVH is used globally to maintain watertight containment.

2. Ray Stream Tracing and Shared Stack Organization

Instead of each ray maintaining a standalone traversal stack, rays are dynamically grouped as "streams" that share traversal state. Each stack entry contains:

The current BVH node index.
A list (array) of rays assigned to that node.

Traversal proceeds as follows:

Pop a node + ray list from the shared stack.
Load the node's data once; test intersection for all rays in the list (utilizing SIMD-friendly routines).
For each intersected child, partition rays (using SIMD masks), and push new node + corresponding ray sublists onto the stack.
On reaching a leaf, rays in the group process intersection with quantized primitives.

In this model, ray origins and directions are compressed as well. For instance, directions may use 32-bit octahedral encoding, and data structures are optimized for coalesced memory access.

Stack and Traversal Traffic

The shared stack approach greatly reduces stack memory traffic, as each BVH node's data is fetched only once per ray group. The amortized cost of stack and node fetches shrinks in inverse proportion to the stream size.

3. Fixed-Point Arithmetic in Node and Primitive Traversal

All intersection computations—including node bounding boxes and triangles—are performed directly in fixed-point integer arithmetic, using the same 8-bit quantized coordinates. This methodology addresses several numerical and implementation challenges:

Geometric Correctness: Fixed-point computations with carefully chosen rounding rules avoid geometric cracks associated with floating-point rounding, especially at grid boundaries.
Arithmetic Operations: Standard addition, subtraction, and multiplication propagate sufficient bit width to avoid overflow, with formulas such as:
- Addition/subtraction: $(R_1.Q_1) \pm (R_2.Q_2) \rightarrow (\max\{R_1, R_2\} + 1.\max\{Q_1, Q_2\})$
- Multiplication: $(R_1.Q_1) \cdot (R_2.Q_2) \rightarrow (R_1 + R_2.Q_1 + Q_2)$
Intersection Routines: The slabs-based ray-box intersection and barycentric (edge function) based ray-triangle test are rewritten in fixed-point, matching quantization and robustly avoiding geometric holes.

No intermediate decompression to floating-point is required throughout the entire traversal and intersection process.

4. Memory Traffic and Performance Analysis

The combination of quantization, ray stream traversal, and fixed-point arithmetic has a pronounced effect on bandwidth and throughput.

Traffic Reduction: In empirical data (e.g., Sponza scene), memory traffic for the fully compressed, stream-based BVH8 implementation is only 18% of the traffic required by traditional, per-ray, uncompressed BVH8 approaches.
Node and Primitive Size: BVH node size is halved or better (from 228B to 96B); triangle storage is reduced 4× (from 36B to 9B).
Intersection Test Volume: Quantization can slightly increase the number of ray-box and ray-triangle intersection tests due to conservative AABB expansion, but this is offset by the dramatic savings in node fetch and stack traffic.
Visual Quality: Edge artifacts are minimal and occur principally with highly aggressive quantization and very small triangles; these are typically perceptually negligible in typical scenes and settings.

Method	Total Memory Traffic (MiB, Sponza)	Relative Traffic
Traditional BVH8	2905	100%
8-wide Quantized + Streams	731	25% (18% avg)

5. Implications for Hardware and SIMD Utilization

The 8-wide BVH structure matches the lane widths of many SIMD architectures (e.g., AVX512, custom ray-tracing cores), maximizing core utilization and memory throughput. Benefits are notable for:

Bandwidth-constrained platforms: Mobile GPUs, integrated graphics, and ASICs.
Power and Area Efficiency: Reduced buffer and arithmetic unit size via quantization and fixed-point operations.
Compatibility: The approach is API-transparent and can be integrated into modern ray tracing frameworks, e.g., via compressed BVH traversal kernels in Vulkan or DirectX.

Ray stream tracing and quantized arithmetic are synergistic with wide node BVHs, as the parallelism and amortization of both stack and node fetches directly leverage hardware parallel capabilities.

6. Geometry Integrity and Robustness

By enforcing a rigorous quantization and rounding policy:

All child AABBs and triangle endpoints are strictly contained within their parents, guaranteeing watertightness throughout traversal.
Fixed-point mathematics mitigates the geometric holes that can arise from floating-point imprecision near shared triangle edges.
Propagated scale factors ensure that adjacent triangles across leaves remain grid-aligned, preventing visible cracks.

7. Application Scope and Suitability

This methodology is especially suitable for:

Large, high-complexity scenes rendered on memory-bandwidth-limited devices.
Power-constrained platforms (e.g., mobile graphics or hardware-accelerated custom ray tracing cores) requiring maximized throughput per byte.
Wide SIMD or SIMT architectures that benefit from bulk ray processing.

Although some increase in intersection count is possible with highly aggressive quantization, practical quality is preserved, and the consistency of the fixed-point approach ensures robust geometric integrity.

Aspect	Traditional BVH8	8-Wide Quantized + Ray Stream
BVH Node/Primitive Size	228B/36B	96B/9B
Stack Usage	Per-ray (high traffic)	Shared (low/aggregate traffic)
Intersection Arithmetic	Floating-point	Fixed-point
Memory Traffic	Baseline (100%)	18% (best-case, average)
SIMD Suitability	Moderate	High (8-wide / Batch-oriented)

This approach substantially increases the efficiency, scalability, and correctness of BVH-based ray tracing, particularly in settings where resource constraints or throughput per bandwidth are prime concerns. All quantitative metrics, algorithmic details, formulas, and methodology are adopted directly from the source research.

PDF Markdown Chat (Upgrade)