RangeViT for 3D LiDAR Segmentation
- RangeViT is a method that applies image-pretrained vision transformers to 2D range-projected LiDAR data for efficient 3D semantic segmentation.
- It integrates a convolutional stem, transformer encoder, and convolutional decoder with skip connections to merge local and global features.
- Empirical evaluations on nuScenes and SemanticKITTI demonstrate state-of-the-art mIoU performance and fast, real-time inference.
RangeViT is a method for 3D semantic segmentation of outdoor LiDAR point clouds that adapts vision transformers (ViTs) to work within the range-projection framework. By mapping point clouds onto 2D range images and leveraging image-pretrained ViT architectures in conjunction with convolutional components, RangeViT achieves state-of-the-art accuracy among projection-based methods while retaining fast and efficient inference. The method is designed to combine the computational benefits of 2D projections with the representational power of large-scale vision transformers, yielding strong performance on datasets such as nuScenes and SemanticKITTI (Ando et al., 2023).
1. Range Projection and Data Encoding
RangeViT operates on input point clouds , where each point comprises three Cartesian coordinates and an intensity value. The method applies a spherical or range projection to map each 3D point to 2D image coordinates for a fixed image size . The mapping is defined by:
- Range:
- Vertical FoV: (upper angle), (lower angle), (total FoV)
- Pixel assignment:
A 5-channel range image is assembled by recording 0 at each 1 for the closest point, zero-filling empty pixels.
2. Model Architecture
2.1 Convolutional Stem
The convolutional stem replaces the standard linear patch embedding with a deeper, nonlinear tokenization more suited to the idiosyncrasies of range images. Specifically, it uses four SalsaNext residual blocks (the “context module”) to produce an intermediate tensor 2 (default 3), with two initial channel expansions (5→32 and 32→4). Average pooling reduces the spatial resolution, and a 5 convolution projects the features to a sequence of visual tokens 6.
2.2 ViT Encoder
Visual tokens are stacked with a class token and added to learnable positional embeddings. Transformer blocks (7) apply multi-head self-attention (MSA) and layer-wise feed-forward networks (FFN) with LayerNorm and residual connections:
- Input: 8
- Each block:
9
The class token is discarded at the end of encoding.
2.3 Convolutional Decoder and Skip Connection
Output tokens are reshaped, passed through a 0 convolution to expand channels, then upsampled to the original image size using PixelShuffle. The upsampled decoder outputs are concatenated with features from the convolutional stem (1), followed by additional convolutional layers (3×3 and 1×1, BatchNorm, LeakyReLU) to recover dense spatial information.
2.4 3D Refiner (KPConv)
Decoded 2D features are bilinearly sampled at the original 3D point coordinates to yield per-point features 2. These are further refined using a KPConv layer, which updates features using local 3D neighborhoods. The final point-wise linear layer infers semantic logits for each point.
3. Principal Design Components
Three essential design choices underlie RangeViT's accuracy and transferability:
- Image-Pretrained ViT Backbones: Unmodified ViT encoder architectures enable initialization from large-scale image collections such as ImageNet-21k (and subsequent fine-tuning on Cityscapes or with self-supervised DINO). This yields consistent improvements (+2–3 pp mIoU) and faster convergence compared to random initialization.
- Convolutional Stem Instead of Linear Embedding: The convolutional stem injects spatial locality and nonlinearity, adapting range images to match the distribution of data used in ViT pretraining. Replacement with a linear embedding results in a ~4 pp mIoU drop.
- Convolutional Decoder with Skip Connection: A lightweight decoder and a single skip connection (from the convolutional stem to the feature decoder) efficiently merge fine-grained and semantic representations. Omission of the skip connection or of convolutional upsampling also incurs a ~4 pp mIoU penalty.
4. Training Regimen and Datasets
RangeViT is evaluated on two principal datasets:
- nuScenes: 32×2048 range images, 28,130 train / 6,019 val, 16 classes.
- SemanticKITTI: 64×2048 range images, 19,130 train / 4,071 val, 19 classes (test: sequences 11–21).
ViT-S/16 models (3, 4, 6 heads) are pretrained on ImageNet-21k, further fine-tuned via Segmenter on Cityscapes. Stem/decoder/KPConv are randomly initialized. Optimization uses AdamW (5, 6, weight decay 7), batch sizes 32 (nuScenes) or 16 (KITTI), and cosine annealing schedule. Data augmentations include random flips, 3D translations, random rotations (±5° roll/pitch/yaw), and random cropping. Inference uses sliding-window averaging.
5. Empirical Performance
5.1 Quantitative Results
nuScenes (val set, mIoU %):
| Method | mIoU |
|---|---|
| RangeNet++ | 65.5 |
| SalsaNext | 72.2 |
| PolarNet | 71.0 |
| RangeViT-IN21k | 74.8 |
| RangeViT-CS | 75.2 |
SemanticKITTI (test set, mIoU %):
| Method | mIoU |
|---|---|
| RangeNet++ | 52.2 |
| SqueezeSegV3 | 55.9 |
| SalsaNext | 59.5 |
| KPRNet | 63.1 |
| Lite-HDSeg | 63.8 |
| RangeViT-CS | 64.0 |
RangeViT narrows the gap to top-performing voxel-based approaches (e.g., Cylinder3D: 76.1% nuScenes, 67.8% SemanticKITTI), while maintaining the computational speed characteristic of projection-based methods.
5.2 Ablation Studies and Analysis
Selected ablation results (nuScenes val, mIoU):
| Stem | Decoder | 3D Refiner | mIoU |
|---|---|---|---|
| Linear | Linear | — | 65.5 |
| Conv | Linear | — | 69.8 |
| Conv | UpConv | — | 73.8 |
| Conv | UpConv | KPConv | 74.6 |
Pretraining initialization (mIoU): Random 72.4%, DINO 73.3%, IN21k 74.8%, Cityscapes 75.2%.
Partial fine-tuning, freezing attention weights and training only FFN+LN, achieves 75.5% vs 75.2% for full fine-tuning. Non-square 8 patches outperform standard 9 (75.2% vs 68.5%).
6. Strengths, Limitations, and Research Outlook
Strengths:
- Efficient reuse of image-pretrained "foundation" vision transformers for 3D LiDAR data.
- Minimal architectural changes; no custom attention mechanisms are required.
- Real-time inference capability (approximately 25 ms per frame on RTX2080).
- State-of-the-art among 2D projection-based segmentation methods.
Known Limitations:
- Some loss of 3D local neighborhood structure due to projection is only partly recoverable by the post-hoc KPConv refiner.
- Performance on sparse classes (e.g., bicycles, traffic cones) remains challenging.
- Limitations arise from the fixed nature of the projection and the inductive bias mismatch between image and LiDAR data.
Future Directions:
- Exploration of flexible tokenizers (e.g., FlexiViT, Perceiver IO) for end-to-end 3D tokenization.
- Development of alternatives to projection-based encoding and increased integration of multi-modal data (e.g., combining LiDAR with camera imagery).
RangeViT demonstrates that with a carefully architected convolutional stem, efficient decoder, and principled use of image-pretrained ViT backbones, projection-based 3D semantic segmentation can achieve accuracy that approaches more computationally intensive voxel-based methods, while preserving computational efficiency and architectural simplicity (Ando et al., 2023).