GPU RSF Level Set Framework
- The paper introduces a GPU-accelerated RSF framework that reformulates traditional level set methods into local computations for efficient segmentation of terabyte-scale biological datasets.
- It employs a dual-level parallel process combining volume-level SPMD and voxel-level SIMD to partition and process high-resolution 3D imaging volumes using CUDA kernels.
- Performance profiling reveals speedups between 100× and 400× over CPU implementations while maintaining high segmentation accuracy and minimal parameter retuning.
The GPU RSF (Region-Scalable Fitting) Level Set Framework is a computational architecture for large-scale semantic segmentation of microvascular structures in three-dimensional biological imaging modalities. By reformulating the traditional RSF level set method—notably suited for modeling near-diffraction-limit network morphologies—so that all core computations are inherently local, the framework enables the entire evolution process to be implemented as a series of data-parallel GPU kernels. This approach specifically addresses the impracticality of conventional level set segmentation over terabyte-scale image volumes, leveraging both single-instruction multiple-data (SIMD) and single-program multiple-data (SPMD) parallelism to achieve substantial speedups and scalability across diverse high-resolution datasets (Niger et al., 2024).
1. RSF Energy Functional and Evolution Equations
The core of the framework is the RSF energy functional which integrates regularization, smoothing, and region-scalable fitting force terms for a signed distance level-set function and image intensity . The energy in dimensions is expressed as: where: - : distance regularization, - : smoothing via , - : region-scalable fitting force with local statistics inside/outside the zero level set, parametrized by smooth Heaviside functions and multi-scale Gaussian convolutions.
The evolution of is driven by the gradient descent on , yielding the update: Intermediate variations are calculated based on and the product rule as required by Euler–Lagrange analysis.
2. Discretization and Numerical Implementation
Discretization is conducted on a uniform grid with spacing . Central-difference stencils are utilized for both the Laplacian and gradients:
with calculated by the Euclidean norm across axes. The update is applied at each time step, incorporating stencil-derived terms, local RSF forces, and regularization. Neumann boundary conditions are enforced using ghost-cell mirroring at the volume edges.
3. GPU Parallelization and Execution Model
Parallel execution is realized at two distinct levels:
- Volume-level SPMD: The full image volume (tens to hundreds of GB) is partitioned into overlapping cubic blocks (typically voxels), each transferred to and processed independently on available GPUs or streams. Overlap width matches the maximum Gaussian kernel radius to avoid artifacts, and processed blocks are merged in overlap regions using linear interpolation.
- Voxel-level SIMD: Within each block, every voxel’s update is independent, mapped directly to individual CUDA threads.
Memory management involves allocating 3D arrays/textures for raw image data and the evolving , as well as all required intermediate structures (e.g., Heaviside-masked images, convolved fields, stencil terms), all in float32 format. Data transfers between host and device are managed with cudaMemcpyAsync and overlapped computation (CUDA streams) to maximize throughput.
A representative CUDA-style pseudocode illustrates the block-wise workflow:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 |
// On CPU: partition full_volume into subvolumes[0..M-1] for each subvol in subvolumes: cudaMemcpyAsync(subvol.I, device.I_tex, stream[subvol]) cudaMemcpyAsync(subvol.phi, device.phi_tex, stream[subvol]) sync all streams for each subvol in subvolumes: evolveBlock<<<grid,block,0,stream[subvol]>>>( device.I_tex, device.phi_tex, parameters ); // kernel calls inside evolveBlock: // 1) heavisideAndMask<<<...>>>(phi, I, H_minus_I, H_plus_I) // 2) gaussianConv3D<<<...>>>(H_minus_I, K_H_minus_I) // 3) gaussianConv3D<<<...>>>(H_plus_I, K_H_plus_I) // 4) compute_rPlusMinus<<<...>>>(K_H_minus_I,K_H_plus_I,r_minus,r_plus) // 5) compute_force<<<...>>>(I,r_minus,r_plus,Force) // 6) gradientLaplacian<<<...>>>(phi,Grad, Lap) // 7) updatePhi<<<...>>>(phi, Grad, Lap, Force, phi_new) cudaMemcpyAsync back phi_new to host // Host merges phi_new blocks with linear interpolation in overlap |
4. Algorithmic and Hardware Optimizations
Key optimizations include:
- Volume-level culling: Skipping blocks where contours have converged or no seed points exist.
- Minimized curtain overlap: Overlap regions are restricted to the length scale of the largest Gaussian convolution, reducing redundant computation.
- Hardware-specific adjustments: Read-only accesses to and via CUDA texture memory enable hardware-accelerated linear interpolation and caching. Shared-memory tiling is employed for efficient block-wise 3D Gaussian convolutions, performed as three sequential 1D operations to minimize memory bandwidth usage. Non-branching computation for smooth Heaviside and delta functions is achieved via direct
atanandfmacalls.
5. Performance Profiling and Quantitative Results
Performance is quantified per kernel and overall block segmentation:
- On an RTX 2080 Ti GPU, each major kernel (e.g., convolution, gradient, Laplacian calculation) achieves speedups from 92× (Heaviside-masked images) up to 360× (gradient, Laplacian) over a single-core Intel Xeon CPU.
- End-to-end segmentation of a voxel block converges in seconds on GPU, whereas CPU execution requires minutes.
- The overall time-step speedup ranges between 100× and 400×, attributed to the data-parallel reformulation and memory optimizations.
A sample profiling table:
| Operation | GPU Time | GPU % | Speed-up (CPU/GPU) |
|---|---|---|---|
| H−I | 41 ms | 7.0% | 92.6× |
| H+I | 41 ms | 7.0% | 92.6× |
| K∗(H−I) | 58 ms | 10% | 281× |
| K∗(H+I) | 57 ms | 9.7% | 287× |
| ∇φ | 50 ms | 8.5% | 360× |
| ∇²φ | 53 ms | 9.1% | 358× |
6. Applications, Modalities, and Scalability
The framework has been directly applied to diverse 3D imaging datasets:
- Light-sheet fluorescence microscopy (LSFM) of mouse ovary and brain (6.6 GB and 103.6 GB volumes, respectively).
- Micro-CT of mouse brains (10 μm isotropic voxels).
- Knife-edge scanning microscope (KESM) imaging of whole brain ( μm voxels, ~16 GB).
In all cases, the GPU RSF pipeline robustly produces sub-pixel-smooth segmentation results without parameter retuning. Performance in both segmentation accuracy (Dice > 0.85) and runtime (minutes on GPU vs. hours on CPU) exceeds that obtained via slice-by-slice thresholding.
Scalability bottlenecks include:
- GPU memory limits (16–32 GB per device), which restrict block sizes to voxels. Larger datasets require multi-GPU strategies or CPU streaming.
- PCIe bandwidth: Transferring large memory buffers (e.g., 500 MB) incurs latencies of several hundred milliseconds, necessitating overlap of transfer and computation.
- Post-processing: Host-side merging of millions of subvolume blocks for whole-organ reconstructions require efficient scatter/gather algorithms with linear blending in overlap regions.
Potential future extensions include leveraging GPU direct RDMA for distributed storage access, multi-GPU tiling, and half-precision arithmetic in intermediate kernels, contingent on numerical stability.
7. Summary and Outlook
The GPU RSF Level Set Framework demonstrates an efficient and scalable solution for segmentation of high-topology, large-scale biological structures. Through the localization of all evolution terms to the voxel-neighborhood scale, and integration with CUDA-optimized kernel launches and memory management, the approach achieves substantial acceleration and broad modality generalization. The described implementation is immediately applicable for segmentation tasks across LSFM, micro-CT, and KESM imaging, handling terabyte-class data efficiently and requiring minimal user adjustment (Niger et al., 2024). This suggests that continued refinements in data partitioning, GPU–CPU streaming, and advanced memory layouts will further expand the applicability to even larger multi-organ or exascale biomedical volumes.