DAGLFNet: LiDAR Segmentation Framework
- DAGLFNet is a deep learning framework for efficient semantic segmentation of 3D LiDAR point clouds using novel global-local feature fusion mechanisms.
- Its architecture incorporates a multi-branch feature extraction module that combines standard, dilated, and edge-enhancing convolutions to capture local and extended spatial contexts.
- A depth-guided attention module further refines segmentation, achieving state-of-the-art mIoU on datasets like SemanticKITTI and nuScenes with real-time performance.
DAGLFNet is a deep learning framework designed for efficient semantic segmentation of pseudo-image representations of 3D point clouds, notably those generated by LiDAR sensors in autonomous navigation and mapping contexts. Its architecture integrates global-local feature fusion, multi-branch convolutional extraction, and deep attention-guided feature fusion to address the challenge of extracting discriminative semantic information from unstructured raw point clouds, especially when using projection-based methods that can induce geometric distortions.
1. Motivation and Significance
Environmental perception for autonomous systems increasingly relies on LiDAR to capture dense spatial data. Traditional methods for LiDAR segmentation—such as voxelization or native point-wise processing—often yield a trade-off between computational overhead and the retention of high-resolution geometric and semantic features. Pseudo-image approaches, which project the point cloud onto a structured 2D grid (e.g., a range image), facilitate the use of 2D convolutional architectures and accelerate processing. Nonetheless, these methods tend to blur boundaries and lose key local information, particularly in regions with occlusions, sparsity, or complex topology. DAGLFNet is introduced to overcome these limitations by explicitly fusing global scene context with fine-grained local geometric features, employing modules that mitigate representation degradation during projection (Chen et al., 12 Oct 2025).
2. Architecture and Major Components
DAGLFNet comprises three principal modules, each targeting different aspects of feature extraction and fusion:
Global-Local Feature Fusion Encoding (GL-FFE)
This module processes the point cloud by segmenting points into groups (according to azimuth, laser-beam, or similar criteria). Within each group:
- Local geometric features are aggregated via pooling and encoded using an MLP.
- These are concatenated with point-level attributes (including offsets and depth), constructing enhanced representations.
- Simultaneously, global scene-level features are derived, stabilizing per-group encoding and making the local features more robust to projection-induced distortion.
Multi-Branch Feature Extraction (MB-FE)
- The pseudo-image, resulting from projecting the encoded point cloud features, is input to several parallel convolutional branches:
- Standard convolution captures local context.
- Dilated convolution (dilation=2) encodes wider spatial context.
- A sequence of followed by convolution emphasizes contour and edge features.
- Outputs of all branches are concatenated, fused via a convolution, and a residual connection adds the original features, enabling effective propagation and refinement.
Feature Fusion via Deep Feature-guided Attention (FFDFA)
- To reconcile group-level global features and point-level local features after projection and back-projection steps, FFDFA employs an attention mechanism guided by depth information.
- Depth is transformed to generate an attention query; original and group-projected features are keys/values.
- The fusion weights obtained through attention maintain spatial coherence and sharpen discriminability, crucial for regions with overlapping semantic categories or projection artifacts.
3. Methodology: Computational Pipeline
The critical stages in the DAGLFNet pipeline are:
- Global-Local Point Cloud Encoder: Receives the point cloud , partitions into groups, encodes features at both the point and group level, and projects them to a 2D grid via .
- Image Feature Encoder: Applies MB-FE to the pseudo-image, extracting rich semantic and contour features across multiple receptive fields.
- Feature Update Module: Utilizes FFDFA to fuse point-level and reprojected group-level features through a depth-guided attention schema. The formulation involves linear maps producing query (), key (), and value () matrices and subsequent attention-weighted fusion:
- Fusion Head: Concatenates and upscales feature maps from earlier stages (using bilinear interpolation and flattening), synthesizing final semantic scores.
The complete process establishes the mapping , where are trainable parameters, yielding per-point label assignments.
4. Empirical Evaluation and Results
DAGLFNet was benchmarked on SemanticKITTI and nuScenes datasets, focusing on autonomous driving scenarios:
- Network configuration: Four stages with depths and input resolutions like (SemanticKITTI).
- Training: AdamW optimizer, OneCycle learning rate scheduling.
- Performance metrics:
- SemanticKITTI: 69.83% mean Intersection over Union (mIoU).
- nuScenes: 78.65% mIoU (validation set).
- Runtime: Average inference time of 45.4 ms/scan (high-end GPU).
- The network consistently outperforms a variety of state-of-the-art approaches by reducing the frequency of misclassifications in challenging scenarios (e.g., blurred boundaries, distant objects, occlusions).
- Ablation studies confirm the incremental value of each component, with additive improvements in mIoU as GL-FFE, MB-FE, and FFDFA modules are integrated.
5. Significance and Practical Applications
The design of DAGLFNet supports real-time semantic segmentation for LiDAR-based perception systems:
- The integration of global and local context addresses key deficiencies in pseudo-image approaches, strengthening feature discriminability even under geometric distortions and sparse sampling.
- Its systematic architecture enables deployment in latency-sensitive environments such as autonomous vehicles, robotics, and large-scale mapping.
- Empirical performance demonstrates robust semantic segmentation, balancing speed and accuracy, with clear practical implications for safety-critical navigation tasks.
- A plausible implication is enhanced perception capabilities in dynamic or complex scenes, where conventional methods may falter.
6. Limitations and Directions for Future Research
DAGLFNet exhibits strong baseline performance, yet some areas remain for further paper:
- Sparse or heavily occluded regions, as well as semantically similar objects (e.g., terrain vs. sidewalk), still present misclassification challenges.
- Future research may refine the depth-guided attention mechanism or introduce modality fusion to further enhance geometric and semantic resolution.
- The network’s adaptability to highly dynamic environments or the integration of multimodal sensor data could be explored for increased robustness and generalization.
7. Comparative Context
Relative to prior pseudo-image and voxelizaton-based approaches, DAGLFNet advances the state-of-the-art by explicitly addressing feature degradation during projection and by adopting modules that synergistically combine semantic, geometric, and contextual information. Its ablation-validated components—global-local fusion, multi-branch extraction, attention-guided fusion—are critical for bridging efficiency and discriminative segmentation performance in time-sensitive LiDAR applications (Chen et al., 12 Oct 2025).