HARP-NeXt: 3D LiDAR Semantic Segmentation
- HARP-NeXt is a cutting-edge fusion network that delivers rapid, accurate semantic segmentation for 3D LiDAR point clouds in autonomous applications.
- It integrates a GPU-optimized preprocessing pipeline with a Conv-SE-NeXt block to drastically reduce latency and computational load.
- Utilizing a multi-scale range-point fusion backbone, HARP-NeXt achieves state-of-the-art performance on benchmarks like nuScenes and SemanticKITTI.
HARP-NeXt is a high-speed and accurate fusion network architecture designed for semantic segmentation of 3D LiDAR point clouds, specifically addressing the dual challenge of inference efficiency and segmentation accuracy. Developed to meet the operational requirements of autonomous vehicles and mobile robots, HARP-NeXt integrates a GPU-optimized pre-processing pipeline, a purpose-built Conv-SE-NeXt feature extraction block, and a multi-scale range-point fusion backbone. This architecture circumvents performance bottlenecks endemic to prior approaches and achieves state-of-the-art results on established benchmarks, such as nuScenes and SemanticKITTI, substantially advancing both research and deployment potential in perception systems for real-time robotics (Haidar et al., 8 Oct 2025).
1. Motivation and Problem Statement
LiDAR semantic segmentation assigns per-point class labels to 3D data, forming a foundational capability for perception stacks in autonomous systems. Accurate segmentation is necessary for downstream tasks like object detection, tracking, and scene understanding. Highly accurate modeling approaches such as point-based networks and sparse convolutions preserve 3D spatial geometry but incur substantial computation—often due to expensive neighbor-search and volumetric convolution operations—which impedes their suitability for real-time use, particularly on resource-constrained embedded hardware.
Projection-based approaches, which convert point clouds to 2D range images, offer faster inference but compromise geometric fidelity by flattening the data, degrading performance on fine-grained structures. Furthermore, most pipelines rely on computationally intensive pre-processing steps, conventionally performed on the CPU (such as spherical projection and k-nearest neighbor searches), that can dominate execution time—accounting for up to 83% of total runtime in some instances.
HARP-NeXt specifically targets these bottlenecks: reducing pre-processing overhead, enabling efficient multi-modal data fusion at multiple abstraction levels, and deploying lightweight yet effective network modules for feature extraction. These design choices are made to meet stringent latency requirements in autonomous vehicle and robot contexts, without sacrificing segmentation accuracy.
2. GPU-Optimized Pre-Processing Pipeline
The HARP-NeXt methodology introduces a GPU-centric approach to pre-processing, minimizing reliance on the CPU and leveraging parallel computation early in the pipeline. Traditionally, projection—mapping 3D points to 2D range images—is performed on the CPU; HARP-NeXt reverses this paradigm by transferring raw point cloud data directly to the GPU, where the bulk of pre-processing is performed using CUDA kernels.
The spherical projection mapping employed is given as:
where are the range image dimensions, is the vertical field-of-view, and is the Euclidean distance of each point. Executing this mapping on the GPU substantially reduces pre-processing latency, which is critical for embedded deployment. This approach avoids the data transfer overhead typically associated with CPU-GPU pipelines and enables real-time segmentation workflows on platforms like NVIDIA Jetson AGX Orin.
3. Conv-SE-NeXt Feature Extraction Block
The Conv-SE-NeXt block is a core architectural component designed for efficient and expressive feature extraction. It combines depth-wise separable convolution (decoupling spatial and channel aggregation), modern activation and normalization strategies, and channel-wise attention mechanisms:
- Depth-Wise Convolution: Captures local spatial dependencies with large kernels (e.g., 3×3, 5×5, 7×7), reducing the need for deep layer stacking.
- Point-Wise 1×1 Convolution: Aggregates channel information efficiently after the spatial filtering.
- Activation and Normalization: Uses Batch Normalization and a fast Hardswish activation , realized as
- Squeeze-and-Excitation (SE) Module: Applies global average pooling to channel descriptors:
Channel weights are computed via learned 1×1 convolutions and a Hardsigmoid nonlinearity.
- Skip Connections: Add the input tensor back to the output for improved gradient propagation.
This architectural design yields low parameter count and latency while delivering high discriminative capacity—a balance that is central to HARP-NeXt’s real-time applicability.
4. Multi-Scale Range-Point Fusion Backbone
HARP-NeXt utilizes a backbone that fuses multi-scale features from both 2D range images (pixel-level) and 3D point clouds (point-level). The Feature Encoder extracts descriptors from raw and projected representations, aggregating projected points into pixels using average or max pooling. Hierarchical refinement is achieved via two mapping functions:
- maps 3D point features to 2D pixel space.
- reverses this mapping, transferring refined pixel features back into 3D point attributes.
Each stage of the backbone employs the Conv-SE-NeXt block for feature extraction, followed by concatenation (with bilinear interpolation where necessary) and attention-weighted fusion:
where is a learned transformation parameterized by . This residual-like fusion method preserves geometric detail from the 3D domain and contextual information from the range image across multiple abstraction levels.
5. Quantitative Performance and Benchmark Results
Empirical evaluation on nuScenes and SemanticKITTI establishes that HARP-NeXt achieves an optimal speed-accuracy balance. On nuScenes, HARP-NeXt scores a mean Intersection-over-Union (mIoU) of 77.1% while eschewing test-time augmentation (TTA); this is comparable to the leading PTv3 method but at a 24-fold speed advantage (total runtimes down to 10 ms on NVIDIA RTX4090 and competitive numbers on Jetson AGX Orin).
On SemanticKITTI, HARP-NeXt reaches ~65.1% mIoU, maintaining low parameter count and moderate computational complexity (few million parameters). The network does not employ ensemble models or additional training data, facilitating reproducible comparisons relative to published baselines.
Benchmark | HARP-NeXt mIoU | Inference Speed | Relative to Top Baseline |
---|---|---|---|
nuScenes | 77.1% | 24× faster | Comparable to PTv3 |
SemanticKITTI | 65.1% | Fast, low MACs | SOTA, low model cost |
6. Deployment and Practical Implications
The integrated design of HARP-NeXt, combining fast GPU-based pre-processing and a low-latency fusion backbone, enables real-time processing on embedded platforms. This efficiency is crucial for autonomous vehicles and mobile robots, where timely interpretation of LiDAR data is essential for safety-critical autonomous decision-making. The reduced memory footprint and fast execution mean deployment is feasible without the need for specialized, high-performance hardware.
No reliance on test-time augmentations or ensembles further simplifies real-world integration and scalability. This suggests HARP-NeXt can be adopted broadly for embedded perception tasks with stringent latency budgets.
7. Code Accessibility and Reproducibility
The complete HARP-NeXt implementation is publicly available, hosted at:
https://github.com/SamirAbouHaidar/HARP-NeXt
The repository contains the necessary instructions for model training and evaluation. Supported environments include CUDA-enabled GPUs (desktop and embedded), and the code is compatible with standard deep learning frameworks, such as PyTorch. Training protocols are fully standardized for comparative evaluation; no additional training data or special procedures are required.
HARP-NeXt exemplifies a fusion architecture combining fast pre-processing, depth-efficient feature extraction, and multi-scale data fusion for LiDAR semantic segmentation. Its methodological innovations and benchmark results substantiate its applicability for real-time autonomous driving and robotics, representing a substantial advance in practical semantic segmentation for embedded systems (Haidar et al., 8 Oct 2025).