Papers
Topics
Authors
Recent
Search
2000 character limit reached

PointCLIP: 3D Point Cloud Classification

Updated 5 February 2026
  • The paper introduces PointCLIP, which uses multi-view depth map projections to adapt CLIP for 3D point cloud classification in both zero-shot and few-shot settings.
  • It employs a frozen CLIP visual encoder with a lightweight inter-view adapter and ensembling strategies to enhance performance on benchmarks like ModelNet and ScanObjectNN.
  • Subsequent advances, including PointCLIP V2, refine projection techniques and prompt generation, extending the framework to segmentation and detection tasks.

PointCLIP is a framework for leveraging large-scale contrastive vision–LLMs, specifically the CLIP architecture, to perform zero-shot and few-shot 3D point cloud classification by bridging the modality gap between 3D geometric data and pre-trained image–text representations. Instead of training new 3D-specific neural architectures, PointCLIP projects 3D point clouds into multi-view 2D depth images, processes these with CLIP’s frozen visual encoder, and matches features to CLIP-encoded text prompts describing 3D categories. Performance is further enhanced by a lightweight ā€œinter-view adapterā€ for few-shot learning and an ensembling strategy with classical 3D networks. Subsequent research, including PointCLIP V2 and variants, builds on this methodology to address domain alignment and extend the framework to segmentation and detection.

1. Background and Motivation

The development of PointCLIP is driven by the recognition that contrastive vision–language pre-training (CLIP), trained on 400M image–text pairs, exhibits robust zero-shot transfer for 2D classification. Extending this capability to 3D domains is nontrivial, due to the lack of natural RGB content and the irregular, unordered nature of point clouds. PointCLIP addresses the challenge by representing 3D point clouds as sets of multi-view 2D depth maps, suitable as input to CLIP’s image encoder, and formulating 3D recognition as a special case of 2D vision–language alignment (Zhang et al., 2021, Huang et al., 2022, Ghose et al., 2024).

CLIP’s open-vocabulary classification is enabled by encoding class names into rich textual embeddings; PointCLIP leverages this for 3D tasks by designing prompts that reflect geometric semantics, such as ā€œpoint cloud depth map of a [CLASS].ā€ The prospect of aligning 3D data with pre-trained 2D knowledge offers the promise of both strong zero-shot and low-resource few-shot learning capacities.

2. Methodology and Architecture

Multi-view Depth Map Projection

PointCLIP projects a normalized point cloud P={(x,y,z)}P = \{(x, y, z)\} into MM fixed camera views (front, right, back, left, top, bottom). Each view orthographically projects the 3D coordinates onto a 2D grid, where pixel values encode depth, eschewing any mesh rendering or photorealistic shading. These sparse single-channel depth maps are resized to 224Ɨ224224 \times 224 pixels for compatibility with CLIP’s image encoder (Zhang et al., 2021). This projection approach is lightweight and preserves the geometric structure crucial for shape classification.

CLIP-Based Feature Extraction

  • Visual Encoding: Each depth map is processed through CLIP’s frozen visual encoder (ResNet-50/101 or ViT-B/16), extracting per-view feature vectors fi∈R1ƗCf_i \in \mathbb{R}^{1\times C}.
  • Textual Encoding: For KK category labels, prompt templates such as ā€œpoint cloud depth map of a [CLASS]ā€ (zero-shot) or ā€œpoint cloud of a big [CLASS]ā€ (few-shot) are generated, substituting each class name. These are encoded to yield a classifier matrix Wt∈RKƗCW_t \in \mathbb{R}^{K\times C}.

Prediction and Aggregation

For each view, similarity-based logits are computed as logitsi=fiWtTlogits_i = f_i W_t^T, and the final point cloud prediction aggregates these logits, weighted by fixed or learnable coefficients αi\alpha_i: logitsp=āˆ‘i=1Mαilogitsi,logits_p = \sum_{i=1}^M \alpha_i logits_i, with predicted probabilities obtained via softmax.

Inter-view Adapter for Few-Shot Learning

PointCLIP augments the basic pipeline with an inter-view adapter, a trainable three-layer MLP, only in the few-shot regime. The adapter globally fuses concatenated multi-view features, applies two linear layers (with ReLU activation), and generates residuals for each view, which are added to the original per-view features before final classification. The adapter is lightweight, containing approximately 0.5‒1M parameters, and is the only component updated during few-shot fine-tuning; both CLIP encoders are kept frozen (Zhang et al., 2021).

Ensemble with Classical 3D Networks

PointCLIP’s predictions can be linearly fused with those from state-of-the-art supervised 3D architectures (e.g., PointNet++, DGCNN, CurveNet), yielding consistent accuracy improvements even when the zero-shot logit is comparatively low (Zhang et al., 2021).

3. Experimental Settings and Results

Datasets and Evaluation Protocols

PointCLIP evaluation uses widely adopted benchmarks:

  • ModelNet10: 3,991 train / 908 test, 10 synthetic classes, 1,024 points per sample
  • ModelNet40: 9,843 train / 2,468 test, 40 classes
  • ScanObjectNN: 2,321 train / 581 test, 15 real-scan categories

Zero-shot evaluation sets M=6M=6 orthogonal views, adapts depth map size per dataset, and tunes or fixes view weights. Few-shot uses M=10M=10 views, including diagonals, and allows view weights α\alpha to be learnable.

Quantitative Results

Task (Backbone) Dataset PointCLIP Acc. CLIP2Point Acc. PointCLIP V2 Acc. PPCITNet Acc.
Zero-shot ModelNet10 30.23% 66.63% 73.13% —
Zero-shot ModelNet40 20.18% 49.38% 64.22% 22.74%
Zero-shot ScanObjectNN 15.15% 35.46% 50.09% —
Few-shot (16) ModelNet40 87.20% (RN101) 89.79% (pretr.) — 88.93%
Few-shot (16) ModelNet10 89.33% 90.21% — 94.30%
Few-shot (16) ScanObjectNN 54.37% 57.49% — 63.22%

PointCLIP achieves zero-shot accuracies of 30.23% (ModelNet10), 20.18% (ModelNet40), and 15.38% (ScanObjectNN). Few-shot learning with adapter boosts ModelNet40 accuracy from 50.71% (1-shot) to 87.20% (16-shot). Ablation studies reveal that six views optimize zero-shot accuracy and that omitting global fusion within the adapter reduces 16-shot accuracy by 3.3% (Zhang et al., 2021, Huang et al., 2022, Ghose et al., 2024).

Ensembling with PointNet++ and CurveNet further improves supervised baselines: for PointNet++, the fused model achieves 92.10% (+2.39%) on ModelNet40 at 16-shot (Zhang et al., 2021).

PointCLIP V2, with improved visual projection and GPT-generated prompts, increases zero-shot ModelNet10 accuracy by +42.90% absolute (to 73.13%) and delivers strong performance on segmentation and detection as well (Zhu et al., 2022).

4. Limitations and Subsequent Improvements

Domain Gap and View Variability

PointCLIP’s accuracy is fundamentally limited by a significant domain gap: depth maps lack the color, texture, and ā€œnaturalā€ cues that CLIP’s visual encoder expects, causing CLIP features derived from depth maps to be out-of-distribution. Sparse point clouds also induce per-view feature variability due to geometry and rendering artifacts.

Adapter Structure

The original adapter aggregates all views into a single global embedding, then feeds this uniformly back to all views, underutilizing view-specific cues. This can limit discrimination for object classes differentiated by local geometry visible only in particular viewpoints (Ghose et al., 2024).

Data Efficiency and Expressivity

PointCLIP is effective for classification, but does not directly address more structured tasks such as detection or semantic segmentation. The generalization to noisy, real-world scans, and complex scenes is unproven.

5. Advances and Extensions

PointCLIP V2

PointCLIP V2 closes the 2D–3D domain gap more aggressively by:

  • ā€œDensifyingā€ and ā€œsmoothingā€ the projected point cloud into depth maps with naturalistic appearance using grid-based operations (min-pooling, Gaussian smoothing, spatial quantization).
  • Prompting GPT-3 to generate hundreds of 3D-specific textual prompts per class, capturing fine-grained shape semantics lacking in generic CLIP prompts.
  • Achieving marked gains: zero-shot classification improves by +42.90%, +40.44%, and +28.75% (ModelNet10/40, ScanObjectNN), and the framework extends to zero-shot part segmentation and detection by propagating CLIP feature tokens and integrating proposal-based 3D architectures (Zhu et al., 2022).

Image Translation and Adapter Refinements

Recent efforts, such as the PPCITNet architecture, integrate a dedicated U-Net trained to translate point cloud masks into plausible pseudo-RGB images that incorporate color and saliency cues, bringing the projection outputs closer to CLIP’s image-text training distribution. Furthermore, a viewpoint adapter with both view-specific and global branches is introduced, yielding feature representations that fuse localized and holistic shape information. These strategies deliver +4–6% accuracy over prior CLIP-based approaches and demonstrate robustness to prompt variation and few-shot scarcity (Ghose et al., 2024).

Pre-training, Contrastive Depth Alignment

CLIP2Point enhances the approach by performing cross-modality image-depth contrastive pre-training on large image/depth pairs from ShapeNet, aligning depth features with CLIP’s image space and reducing feature drift. Combined with a dual-path adapter that fuses both CLIP and specialized depth encoders, these innovations lift zero-shot and few-shot accuracy by large margins compared to vanilla PointCLIP (Huang et al., 2022).

6. Comparative Summary and Perspectives

Method Projection Visual Encoder Adapter Text Prompting Notable Extensions
PointCLIP Orthographic Depth Frozen CLIP 3-layer MLP (global+per-view) Handcrafted 3D prompt Ensemble with 3D models
PointCLIP V2 Densify+Smooth Grid Frozen CLIP Lightweight (learns on few-shot) GPT-3 generated Segmentation, Detection (zero-shot)
PPCITNet U-Net Mask→RGB Frozen CLIP Dual-branch (view+global) Handcrafted prompt Color/saliency cue translation
CLIP2Point Depth w/ Dilation CLIP + Depth Encoder Dual-path Adapter Handcrafted prompt Pre-trained on image-depth pairs

Later work demonstrates that the 2D–3D transfer is highly sensitive to projection realism and prompt design; pre-training for domain alignment is highly beneficial, and adapter neutrality (no bias toward global or local cues) increases robustness in low-data settings. All approaches keep CLIP’s core encoders frozen for efficiency and prevention of catastrophic forgetting.

7. Limitations and Future Directions

Current variants of PointCLIP are constrained by:

  • Sensitivity to the choice of projection parameters, view count, and mask rendering, impacting generalization.
  • Residual domain shift when faced with real-world point cloud distributions (LiDAR, cluttered scenes, thin structures).
  • Lack of explicit reasoning about local geometric relations or end-to-end learned projections.
  • Limited extension explored for large-scale scene understanding, open-vocabulary segmentation, and interactive 3D tasks.

Proposed future directions include integrating adaptive, learnable multi-resolution projection modules, jointly fine-tuning CLIP with self-supervised geometric objectives, leveraging GPT-4 or multimodal LLMs for complex visual-textual queries, and employing pre-training with extensive 3D–text corpora to further narrow residual domain gaps (Zhang et al., 2021, Zhu et al., 2022, Huang et al., 2022, Ghose et al., 2024).

Topic to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to PointCLIP.