- The paper introduces EZ-SP, a GPU-native pipeline that learns semantic superpoints for rapid 3D segmentation.
- It employs a parallel, graph-based clustering algorithm using a compact deep backbone (<60k parameters) to overcome CPU bottlenecks.
- It achieves near-SOTA mIoU across diverse datasets, enabling real-time, scalable 3D segmentation for robotics and urban mapping.
EZ-SP: Fast and Lightweight Superpoint-Based 3D Segmentation
Introduction and Motivation
This work presents EZ-SP, a GPU-native, learnable, superpoint-based pipeline for efficient large-scale 3D semantic segmentation. Unlike prior superpoint methods that are bottlenecked by CPU-bound clustering steps and reliance on handcrafted features, EZ-SP learns geometrically and semantically coherent superpoints using a compact, deep backbone (sub-60k parameters) and a highly parallel, GPU-efficient clustering algorithm. The pipeline is explicitly designed to address the computational challenges posed by massive real-world point clouds in robotics, autonomous driving, and urban mapping, achieving scalable, real-time segmentation without sacrificing accuracy.
Pipeline Overview
EZ-SP decomposes the 3D semantic segmentation pipeline into three principal stages:
- Learned Embedding for Semantic Transition Detection: A sparse convolutional neural network maps points into a low-dimensional space where semantic transitions are accentuated through a contrastive surrogate loss. Embeddings are trained to sharply distinguish point pairs across semantic boundaries, facilitating later partitioning.
- GPU-Optimized Superpoint Partitioning: The key architectural advance is a graph-based, greedy, bottom-up clustering algorithm implemented with CUDA-backed operations. This approach eschews the fixed cluster count and sequential bottlenecks of k-means, instead using parallel edge merge gains to produce adaptive partitions with semantic and geometric coherence in a single device pass.
- Superpoint-Level Semantic Classification: Once partitioned, superpoints rather than individual points are classified via a lightweight transformer, reducing inference memory and compute by multiple orders of magnitude.
The full pipeline is visualized below following input embedding, parallel clustering, and superpoint semantic labeling steps:





Figure 1: EZ-SP pipeline workflow, illustrating input scene embedding, GPU-based clustering into semantically coherent superpoints, and superpoint-wise classification for rapid dense segmentation.
Partitioning Algorithmic Details
The partitioning step solves a contour-regularized energy minimization problem over the graph of point embeddings, where merges are greedily selected by highest merge gain:
$\Delta(P,Q) = -\frac{|P||Q|}{|P|+|Q|}\|\Feat_P - \Feat_Q\|^2 + \lambda \sum_{(p,q)\in(P \times Q)\cap\Edges} w_{p,q}$
EZ-SP leverages a parallel greedy merging scheme, executing maximal non-conflicting merges per iteration, with weakly connected components used to resolve merge chains efficiently. This enables rapid partitioning scaling linearly with point cloud size and GPU memory, contrasting sharply with the hour-long parameter sweeps and feature engineering endemic to CPU graph algorithms.
Hierarchical partitioning is trivially realized by recursive application, providing multi-scale superpoints. This multi-level structure further improves accuracy and efficiency in downstream segmentation.
Quantitative Outcomes
EZ-SP exhibits strong empirical results across three domains:
- Speed: Superpoint partitioning is over an order of magnitude faster than leading CPU-based techniques (e.g., PCP), and the full pipeline matches or exceeds LiDAR data acquisition rates (>1.3M points/s).
- Model Size: The partitioning backbone (<60k parameters) and classification head (<330k parameters) together require <2 MB of VRAM, permitting deployments on embedded or consumer-grade GPUs.
- Segmentation Accuracy: On S3DIS, KITTI-360, and DALES, EZ-SP achieves mIoU within 1–2 points of state-of-the-art models, matching or outperforming previous superpoint approaches even with minimal hyperparameter tuning. Its generalization holds for scenes ranging from indoor environments to aerial urban surveys.
A comparative throughput–accuracy–size analysis is presented below:





Figure 2: Inference speed versus mIoU versus model size on S3DIS: EZ-SP achieves near-SOTA accuracy with two orders of magnitude fewer parameters and unmatched throughput.
In extensive benchmarks, EZ-SP consistently produces semantically pure partitions (oracle mIoU) with highly efficient throughput, outperforming prior methods including VCCS, PCP, and differentiable k-means approaches both quantitatively and in qualitative partition adaptability.
Practical Implications and Applications
EZ-SP addresses several practical deployment hurdles:
- Real-Time Robotics: The GPU-native pipeline allows entire automotive LiDAR frames (∼1M points) to be segmentated in real-time on low-power embedded devices.
- Scalability: Processing entire building-scale or city-scale aerial point clouds in a single pass is enabled by the low memory footprint and lack of CPU-GPU bottlenecks.
- Versatility: The same EZ-SP model generalizes across diverse datasets without dataset-specific hyperparameter sweeps or feature engineering, simplifying deployment in heterogeneous sensing applications.
For large-scale urban mapping, AR/VR, and intelligent infrastructure, the pipeline facilitates continuous spatial understanding constrained only by hardware VRAM capacity.
Theoretical Implications and Future Directions
On a theoretical front, EZ-SP demonstrates that learning boundary-sensitive, spatially local embeddings can effectively replace geometric and radiometric handcrafted features for 3D oversegmentation. The modularity and parallelizability of the combinatorial partitioning strategy recommend it as a foundation for further development, including:
- Instance and Panoptic Segmentation: Extensions into non-semantic tasks using hierarchical superpoint structures should be straightforward.
- Self-Supervised or Weak Label Regimes: The edge-based surrogate loss makes unsupervised or weakly supervised partitioning feasible.
- Heterogeneous and Multimodal Fusion: Direct support for additional radiometric features and sensor types using the learnable backbone.
Continued work may focus on integrating temporal consistency for sequential LiDAR data, extending the pipeline to video-rate dynamic scene understanding.
Conclusion
EZ-SP provides a compact, learnable, fully GPU-based pipeline for superpoint-based 3D segmentation, breaking the partitioning bottleneck inherent in previous workflows. It achieves near-SOTA semantic segmentation accuracy in indoor, outdoor, and aerial domains with unprecedented inference speed and minimal VRAM, facilitating real-time, large-scale deployment of 3D perception. Its algorithmic and architectural contributions open pathways for scalable scene understanding and rapid research iteration in both academic and industry applications.