KPConv: Flexible and Deformable Convolution for Point Clouds (1904.08889v2)
Abstract: We present Kernel Point Convolution (KPConv), a new design of point convolution, i.e. that operates on point clouds without any intermediate representation. The convolution weights of KPConv are located in Euclidean space by kernel points, and applied to the input points close to them. Its capacity to use any number of kernel points gives KPConv more flexibility than fixed grid convolutions. Furthermore, these locations are continuous in space and can be learned by the network. Therefore, KPConv can be extended to deformable convolutions that learn to adapt kernel points to local geometry. Thanks to a regular subsampling strategy, KPConv is also efficient and robust to varying densities. Whether they use deformable KPConv for complex tasks, or rigid KPconv for simpler tasks, our networks outperform state-of-the-art classification and segmentation approaches on several datasets. We also offer ablation studies and visualizations to provide understanding of what has been learned by KPConv and to validate the descriptive power of deformable KPConv.
Summary
- The paper introduces KPConv, which uses learnable kernel points to perform convolutions directly on 3D point clouds without intermediate grid representations.
- It presents both rigid and deformable versions, where deformable kernels adapt locally via learned offsets to capture complex geometries.
- Experimental results demonstrate competitive state-of-the-art performance in tasks like classification and segmentation on diverse 3D datasets.
This paper introduces Kernel Point Convolution (KPConv), a novel convolution operator designed specifically for 3D point clouds that operates directly on point data without requiring an intermediate grid representation. The core idea is to define convolution kernels using a set of "kernel points" whose weights are learned by the network. These kernel points are located in Euclidean space and influence input points in their vicinity.
Core Concepts of KPConv
- Kernel Point Convolution: The convolution at a point x is defined as:
(F∗g)(x)=xi∈Nx∑g(xi−x)fi
where fi are features of input points xi within a radius neighborhood Nx of x. The kernel function g(yi) for a relative point yi=xi−x is a sum of influences from K kernel points xk:
g(yi)=k<K∑h(yi,xk)Wk
Here, Wk are learnable weight matrices (mapping Din to Dout features) associated with each kernel point xk. The correlation function h(yi,xk) determines the influence of kernel point xk on input point yi. A linear correlation is used:
h(yi,xk)=max(0,1−σ∥yi−xk∥)
where σ is the influence distance of the kernel points. This is simpler than Gaussian correlation and aids gradient backpropagation, especially for deformable kernels.
* Neighborhoods: Radius neighborhoods are preferred over k-nearest-neighbors (KNN) for robustness to varying point densities. * Flexibility: The number of kernel points K is not fixed, allowing for varying kernel complexity.
- Rigid KPConv: For the rigid version, kernel point positions xk are fixed. They are initialized by solving an optimization problem to distribute them regularly within a sphere (repulsive forces between points, attractive force to the sphere center, one point fixed at the center). These points are then rescaled to an average radius of 1.5σ to ensure good spatial coverage and slight overlap.
- Deformable KPConv: To allow the kernel to adapt to local geometry, the positions of kernel points can be learned. Instead of learning a single global set of deformed positions, the network generates K shifts Δk(x) for each convolution location x. The deformable kernel function becomes:
gdeform(yi,Δ(x))=k<K∑h(yi,xk+Δk(x))Wk
The offsets Δk(x) are predicted by a separate, preceding rigid KPConv layer that maps input features to $3K$ values (for 3D shifts).
* Regularization for Deformable Kernels: A key challenge is "lost" kernel points, where points are shifted away from input data, resulting in null gradients. Two regularization losses are introduced: * Fitting Loss (Lfit): Penalizes the distance between a (deformed) kernel point and its closest input neighbor, encouraging kernel points to stay close to the data.
Lfit(x)=k<K∑yimin(σ∥yi−(xk+Δk(x))∥)2
* Repulsive Loss (Lrep): Penalizes overlap between deformed kernel points to prevent them from collapsing together.
Lrep(x)=k<K∑l=k∑h(xk+Δk(x),xl+Δl(x))2
The total regularization loss Lreg=x∑(Lfit(x)+Lrep(x)) is added to the main task loss.
Implementation and Network Architecture
- Subsampling: Grid subsampling is used to control point density at each layer. Support points for a layer are barycenters of input points in non-empty grid cells. This ensures spatial consistency.
- Pooling Layer: To create hierarchical architectures, the grid cell size is doubled at pooling layers. Features are pooled using "strided KPConv" (KPConv followed by subsampling), analogous to strided convolutions in 2D CNNs.
- KPConv Layer:
- Inputs: Points P, features F, neighborhood indices N.
- Neighborhood matrix N has a fixed max size (nmax), with "shadow" neighbors (unused elements for smaller neighborhoods) ignored during computation.
- Network Parameters:
- Cell size dlj for layer j.
- Kernel influence distance σj=Σ×dlj.
- Convolution radius rj: 2.5σj for rigid, ρ×dlj for deformable.
- Default parameters: K=15 kernel points, Σ=1.0, ρ=5.0. dl0 is dataset-dependent.
- Architectures:
- KP-CNN (Classification): A 5-layer CNN. Each layer has two ResNet-like bottleneck blocks (KPConv, Batch Norm, Leaky ReLU). Global average pooling follows the last layer, then fully connected and softmax layers. Deformable kernels are typically used in later layers.
- KP-FCNN (Segmentation): A fully convolutional network with an encoder (similar to KP-CNN) and a decoder using nearest upsampling and skip connections. Unary convolutions (1x1 equivalent) process concatenated features from skip links and upsampled features.
KPConv Block (ResNet-like):
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 |
Input Features (D_in) | KPConv (D_in -> D_out/2 or D_out/4 for bottleneck) Batch Norm Leaky ReLU | KPConv (D_out/2 -> D_out/2) Batch Norm Leaky ReLU | KPConv (D_out/2 -> D_out, or D_out -> 2D for strided) Batch Norm |-------------------| | | (Shortcut connection, possibly with 1x1 conv or max pool if strided/dims change) Add | Leaky ReLU | Output Features (D_out or 2D) |
Practical Implementation Details
Variable Batch Size: Due to varying point cloud sizes, batches are formed by accumulating point clouds until a total point limit is reached. This ensures memory constraints are met while maximizing GPU utilization.
Input Features: For classification on ModelNet40, a constant feature of 1 is assigned to each input point. For segmentation, point coordinates (x,y,z) are added as features, along with the constant 1. If color is available (e.g., scene segmentation), RGB values are used as features, still retaining the constant 1 feature.
Training: Momentum optimizer is used. Learning rate schedules decrease exponentially. For deformable KPConv, the learning rate for the offset-generating kernel is 0.1 times the main network's learning rate.
Scene Segmentation Pipeline: Large scenes are segmented by processing smaller spherical subclouds. At training, spheres are randomly picked. At testing, spheres are picked regularly to cover the scene, and predictions for each point (potentially seen by multiple spheres) are averaged (voting scheme).
Results and Applications
ModelNet40 Classification: Rigid KPConv slightly outperforms deformable KPConv, suggesting simpler tasks might not benefit from the increased complexity of deformable kernels. Achieved 92.9% OA (rigid).
ShapeNetPart Part Segmentation: Deformable KPConv performs better (86.4% mIoU) than rigid (86.2% mIoU), indicating its utility in more complex geometric tasks.
3D Scene Segmentation (ScanNet, S3DIS, Semantic3D, Paris-Lille-3D):
- KPConv (both versions) achieves state-of-the-art or competitive results across these diverse datasets.
- Deformable KPConv tends to perform better on large, diverse datasets like S3DIS (67.1% mIoU) and Paris-Lille-3D (75.9% mIoU).
- Rigid KPConv can be better on datasets with less object instance diversity or simpler geometry, like Semantic3D (74.6% mIoU for rigid vs 73.1% for deformable).
- On ScanNet, rigid KPConv achieved 68.6% mIoU, while deformable was 68.4%, although validation studies showed deformable was better.
Key Advantages and Insights
- Flexibility: The number of kernel points can be chosen.
- Adaptability (Deformable KPConv): Kernels can learn to adapt their shape to local geometry, improving performance on complex tasks. The Effective Receptive Field (ERF) visualization shows deformable kernels adapt to object size and shape.
- Robustness: Radius neighborhoods and grid subsampling handle varying point densities effectively.
- Efficiency: Despite the flexibility, the method is computationally manageable, and the subsampling strategy helps control costs.
- Descriptive Power: Deformable KPConv is more robust to a lower number of kernel points, indicating greater descriptive power per kernel point. Ablation studies on ScanNet showed deformable KPConv losing only 1.5% mIoU when restricted to 4 kernel points, versus 3.5% for rigid KPConv.
- Learned Features: Visualizations show that KPConv learns hierarchical features, from simple geometric primitives (planes, lines, corners) in early layers to more complex shapes in later layers.
Potential Limitations and Considerations
- Computational Cost: While efficient for a point-based method, it can be more demanding than voxel-based methods with highly optimized sparse convolutions, especially with many kernel points or large neighborhoods.
- Parameter Tuning: Parameters like K, Σ, ρ, and dl0 need to be chosen, potentially through cross-validation, which can be time-consuming.
- Deformable Kernel Complexity: While powerful, deformable kernels add more parameters and complexity, which might lead to overfitting on simpler or smaller datasets if not regularized properly or if the model is too deep. The regularization loss factors are also hyperparameters.
Real-World Applications
KPConv is suitable for tasks requiring detailed understanding of 3D point cloud geometry:
- Autonomous Driving: Semantic segmentation of LiDAR scans for identifying roads, vehicles, pedestrians, etc. Deformable kernels could adapt to varied object shapes.
- Robotics: Object recognition and pose estimation, scene understanding for navigation and manipulation.
- Augmented/Virtual Reality: Real-time scene reconstruction and semantic understanding.
- 3D Modeling and Design: CAD model classification, part segmentation, and shape analysis.
- Geospatial Analysis: Classification and segmentation of large-scale aerial or terrestrial LiDAR data (e.g., buildings, vegetation, ground).
The provided code (TensorFlow) allows practitioners to implement and extend KPConv for these and other applications. The design principles (kernel points, deformable mechanism, subsampling) offer a strong foundation for future research in point cloud processing.
Related Papers
- PointConv: Deep Convolutional Networks on 3D Point Clouds (2018)
- Adaptive Graph Convolution for Point Cloud Analysis (2021)
- MKConv: Multidimensional Feature Representation for Point Cloud Analysis (2021)
- Hausdorff Point Convolution with Geometric Priors (2020)
- KPConvX: Modernizing Kernel Point Convolution with Kernel Attention (2024)