Papers
Topics
Authors
Recent
Search
2000 character limit reached

PointConv: 3D Point Cloud Convolutions

Updated 10 June 2026
  • PointConv is a convolution operator for 3D point clouds that uses a Monte Carlo approximation of continuous convolution with density correction to handle irregular sampling.
  • It leverages MLP-based weight functions and polynomial kernels to enable translation and permutation invariance while ensuring computational efficiency.
  • Applications of PointConv include advanced classification and segmentation tasks, achieving state-of-the-art benchmarks on datasets like ModelNet40, ShapeNet, and ScanNet.

PointConv is a convolutional operator and neural network architecture for learning on 3D point clouds, enabling fully-convolutional deep networks on unordered, irregularly sampled Euclidean point sets. Unlike grid-based CNNs, PointConv formulates convolution as a Monte Carlo approximation of continuous convolution on R3\mathbb{R}^3 by learning weight and density functions over local neighborhoods. The operator supports translation and permutation invariance, corrects for nonuniform spatial sampling, and can be efficiently implemented for scalability. PointConv and its subsequent advancements—including polynomial-based weights and viewpoint-invariant features—achieve state-of-the-art results in classification and segmentation across 3D vision benchmarks.

1. Mathematical Formulation of PointConv

Let F:R3→RCinF: \mathbb{R}^3 \to \mathbb{R}^{C_{\text{in}}} be input features and W:R3→RCout×CinW: \mathbb{R}^3 \to \mathbb{R}^{C_{\text{out}} \times C_{\text{in}}} a learnable convolution kernel. Standard continuous convolution is

(F∗W)(p)=∫τ∈R3W(τ)F(p+τ)dτ.(F*W)(p) = \int_{\tau \in \mathbb{R}^3} W(\tau) F(p + \tau) d\tau.

For a finite point cloud {pj}⊂R3\{p_j\} \subset \mathbb{R}^3, the operator is discretized by a Monte Carlo sum over local neighborhoods N(i)\mathcal{N}(i), with kernel density-corrected weights: PointConv(F,pi)=∑pj∈N(i)1D(pj)W(pi−pj)F(pj),\mathrm{PointConv}\left(F, p_i\right) = \sum_{p_j \in \mathcal{N}(i)} \frac{1}{D(p_j)} W(p_i - p_j) F(p_j), where D(pj)D(p_j) is the estimated sampling density at pjp_j.

The weight function WW is parameterized by a multi-layer perceptron (MLP) that maps relative coordinates F:R3→RCinF: \mathbb{R}^3 \to \mathbb{R}^{C_{\text{in}}}0 to kernel weights, ensuring translation and permutation invariance. Density F:R3→RCinF: \mathbb{R}^3 \to \mathbb{R}^{C_{\text{in}}}1 is estimated by kernel density estimation (KDE) using a Gaussian or similar kernel. Optionally, F:R3→RCinF: \mathbb{R}^3 \to \mathbb{R}^{C_{\text{in}}}2 is itself modulated by a 1D MLP for learnable adaptive compensation (Wu et al., 2018).

2. Efficient Computation and Memory Optimization

A naive implementation requires memory proportional to F:R3→RCinF: \mathbb{R}^3 \to \mathbb{R}^{C_{\text{in}}}3 for batch size F:R3→RCinF: \mathbb{R}^3 \to \mathbb{R}^{C_{\text{in}}}4, F:R3→RCinF: \mathbb{R}^3 \to \mathbb{R}^{C_{\text{in}}}5 points, F:R3→RCinF: \mathbb{R}^3 \to \mathbb{R}^{C_{\text{in}}}6 neighbors, input and output channels F:R3→RCinF: \mathbb{R}^3 \to \mathbb{R}^{C_{\text{in}}}7, F:R3→RCinF: \mathbb{R}^3 \to \mathbb{R}^{C_{\text{in}}}8. The key insight is that the last layer of the MLP weight network is linear. For each neighbor F:R3→RCinF: \mathbb{R}^3 \to \mathbb{R}^{C_{\text{in}}}9, let the MLP intermediary be a W:R3→RCout×CinW: \mathbb{R}^3 \to \mathbb{R}^{C_{\text{out}} \times C_{\text{in}}}0-vector W:R3→RCout×CinW: \mathbb{R}^3 \to \mathbb{R}^{C_{\text{out}} \times C_{\text{in}}}1, with shared linear weights W:R3→RCout×CinW: \mathbb{R}^3 \to \mathbb{R}^{C_{\text{out}} \times C_{\text{in}}}2: W:R3→RCout×CinW: \mathbb{R}^3 \to \mathbb{R}^{C_{\text{out}} \times C_{\text{in}}}3 Reordering sums allows computation by forming W:R3→RCout×CinW: \mathbb{R}^3 \to \mathbb{R}^{C_{\text{out}} \times C_{\text{in}}}4, followed by a W:R3→RCout×CinW: \mathbb{R}^3 \to \mathbb{R}^{C_{\text{out}} \times C_{\text{in}}}5 convolution with W:R3→RCout×CinW: \mathbb{R}^3 \to \mathbb{R}^{C_{\text{out}} \times C_{\text{in}}}6. This lowers memory use to W:R3→RCout×CinW: \mathbb{R}^3 \to \mathbb{R}^{C_{\text{out}} \times C_{\text{in}}}7 and enables deep, wide PointConv architectures (Wu et al., 2018).

3. Network Architecture and Applications

PointConv architectures use PointNet++-style encoder-decoder designs, replacing local PointNet aggregations with PointConv operators.

Typical architectures:

  • Classification (ModelNet40):
    • Input: W:R3→RCout×CinW: \mathbb{R}^3 \to \mathbb{R}^{C_{\text{out}} \times C_{\text{in}}}8 points, W:R3→RCout×CinW: \mathbb{R}^3 \to \mathbb{R}^{C_{\text{out}} \times C_{\text{in}}}9
    • Encoder: SetAbstraction with farthest point sampling and (F∗W)(p)=∫τ∈R3W(Ï„)F(p+Ï„)dÏ„.(F*W)(p) = \int_{\tau \in \mathbb{R}^3} W(\tau) F(p + \tau) d\tau.0 neighbors, PointConv MLP (F∗W)(p)=∫τ∈R3W(Ï„)F(p+Ï„)dÏ„.(F*W)(p) = \int_{\tau \in \mathbb{R}^3} W(\tau) F(p + \tau) d\tau.1, then (F∗W)(p)=∫τ∈R3W(Ï„)F(p+Ï„)dÏ„.(F*W)(p) = \int_{\tau \in \mathbb{R}^3} W(\tau) F(p + \tau) d\tau.2
    • Global max-pooling, fully connected (FC) layers to softmax over 40 classes
  • Part Segmentation (ShapeNet):
    • Encoder/Decoder: Hierarchical subsampling and PointConv MLPs, PointDeconv applied for feature propagation, skip connections for refinement
  • Scene Segmentation (ScanNet):
    • Deep encoder-decoder with PointConv and PointDeconv, grid subsampling, multi-stage skip fusion

Quantitative results show 92.5% accuracy on ModelNet40, class-averaged IoU of 82.8% (instance-avg 85.7%) on ShapeNet, and 55.6% mIoU on ScanNet, outperforming prior 3D point cloud networks (Wu et al., 2018). On CIFAR-10 converted to a point cloud, PointConv matches the performance of grid CNNs.

4. Robustness Enhancements: Polynomial Kernels and Viewpoint Invariance

Subsequent work explores the effect of the kernel parameterization and geometric invariance. The MLP-based weight function can be replaced by a fixed third-degree polynomial basis (F∗W)(p)=∫τ∈R3W(τ)F(p+τ)dτ.(F*W)(p) = \int_{\tau \in \mathbb{R}^3} W(\tau) F(p + \tau) d\tau.3 with learned linear coefficients, yielding a kernel: (F∗W)(p)=∫τ∈R3W(τ)F(p+τ)dτ.(F*W)(p) = \int_{\tau \in \mathbb{R}^3} W(\tau) F(p + \tau) d\tau.4 where (F∗W)(p)=∫τ∈R3W(τ)F(p+τ)dτ.(F*W)(p) = \int_{\tau \in \mathbb{R}^3} W(\tau) F(p + \tau) d\tau.5 are learnable coefficients and (F∗W)(p)=∫τ∈R3W(τ)F(p+τ)dτ.(F*W)(p) = \int_{\tau \in \mathbb{R}^3} W(\tau) F(p + \tau) d\tau.6 is a 20-dimensional 3D basis normalized such that (F∗W)(p)=∫τ∈R3W(τ)F(p+τ)dτ.(F*W)(p) = \int_{\tau \in \mathbb{R}^3} W(\tau) F(p + \tau) d\tau.7 regularization corresponds to a Sobolev (F∗W)(p)=∫τ∈R3W(τ)F(p+τ)dτ.(F*W)(p) = \int_{\tau \in \mathbb{R}^3} W(\tau) F(p + \tau) d\tau.8 penalty, encouraging smoothness (Li et al., 2021). The training loss becomes

(F∗W)(p)=∫τ∈R3W(τ)F(p+τ)dτ.(F*W)(p) = \int_{\tau \in \mathbb{R}^3} W(\tau) F(p + \tau) d\tau.9

where {pj}⊂R3\{p_j\} \subset \mathbb{R}^30 controls the regularization strength.

A viewpoint-invariant (VI) input descriptor {pj}⊂R3\{p_j\} \subset \mathbb{R}^31 replaces the raw offset {pj}⊂R3\{p_j\} \subset \mathbb{R}^32 for each neighbor. This 8D vector is computed from the relative position, surface normals, and several geometric projections, making the resulting kernel insensitive to global rotation and robust to scale changes and variable local sampling density.

5. Empirical Evaluation and Robustness Analysis

PointConv and its polynomial variants demonstrate improved robustness to sampling density, scale, and rotation in both 2D and 3D settings. Notable empirical findings (Li et al., 2021):

  • On 2D MNIST under scale changes, {pj}⊂R3\{p_j\} \subset \mathbb{R}^33-ball neighborhoods with polynomial+Sobolev weighting achieve {pj}⊂R3\{p_j\} \subset \mathbb{R}^3495% accuracy on heavily rescaled objects, significantly outperforming both standard MLP-based PointConv and 2D CNNs.
  • For rotations, polynomial+Sobolev with {pj}⊂R3\{p_j\} \subset \mathbb{R}^35-ball neighbors yields 85.5% accuracy versus 37.0% for MLP-PointConv and 74.3% for classical CNN.
  • On ScanNet, use of the VI descriptor yields substantially higher mIoU under varying test point densities (e.g., 44.7% at 10K points for VI, versus 17.8% for standard {pj}⊂R3\{p_j\} \subset \mathbb{R}^36 input).
  • For SemanticKITTI, VI-PointConv achieves 59.6% mIoU, outperforming vanilla PointConv (53.0%) and matching or surpassing KPConv under identical conditions.
Dataset & Task Method Key Metric/Value
ModelNet40 Classification PointConv 92.5% accuracy
ShapeNet Part Segmentation PointConv Class-IoU 82.8%
ScanNet Scene Segmentation PointConv 55.6% mIoU
SemanticKITTI Segmentation VI-PointConv 59.6% mIoU

Robustness improvements arise because polynomial bases are fixed global functions, reducing overfitting to training point patterns; Sobolev regularization suppresses spurious high-frequency filters; and VI descriptors remove sensitivity to rigid transforms.

6. Invariance, Limitations, and Theoretical Properties

Key invariances:

  • Translation invariance: {pj}⊂R3\{p_j\} \subset \mathbb{R}^37 depends only on relative coordinates {pj}⊂R3\{p_j\} \subset \mathbb{R}^38.
  • Permutation invariance: Neighborhood summation is order-independent; the MLP or polynomial is shared across all neighborhoods.
  • Rotation/scale robustness: Achieved by substituting raw coordinates with the VI descriptor and fixed polynomial bases.

Limitations:

  • Offline KDE density estimation can be computationally expensive, though alternatives (learnable density, simple neighbor counting) can mitigate this.
  • Local MLPs introduce higher per-region compute and memory overhead versus fixed filters, motivating efficient sum reordering.
  • Extreme sparsity or highly anisotropic sampling may challenge KDE and neighborhood construction.
  • For polynomial+VI extensions, robustness to arbitrary affine transforms is not guaranteed; some sensitivity to deformation or non-Euclidean structure can remain (Wu et al., 2018, Li et al., 2021).

7. Summary and Impact

PointConv provides a principled, translation- and permutation-invariant method for deep convolution on point clouds, correcting for sampling irregularities and supporting efficient hierarchical architectures. Its extensions with polynomial weight functions and geometric-invariant inputs deliver improved robustness to sampling density, object pose, and scale. PointConv constitutes a foundation for point-based deep learning networks in 3D scene understanding, segmentation, and classification, and achieves state-of-the-art performance on a range of established datasets (Wu et al., 2018, Li et al., 2021).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (2)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to PointConv.