ScanObjectNN: 3D Object Classification Benchmark

Updated 20 January 2026

ScanObjectNN is a large-scale 3D point cloud benchmark that evaluates object classification models under real-world challenges such as occlusion, noise, and clutter.
It offers multiple dataset variants with progressive perturbations and a standardized 80/20 train-test split using metrics like overall and mean class accuracy.
Advances in architectural innovations, pretraining, and augmentation strategies have led to state-of-the-art results on challenging variants like PB_T50_RS.

ScanObjectNN is a large-scale, real-world 3D point cloud object classification benchmark designed to evaluate the robustness and generalization capacity of 3D deep learning models under realistic capture conditions including clutter, occlusion, partiality, and background noise. Its construction, split protocol, and the evolution of baseline and state-of-the-art methods on its various difficulty settings have made it a standard for assessment of point cloud analysis models, particularly for those aiming for strong real-to-real and synthetic-to-real transfer (Uy et al., 2019).

1. Dataset Construction and Structure

ScanObjectNN was derived from SceneNN (100 highly cluttered indoor scenes) and ScanNet (1,513 reconstructed scenes). The curation process selected, manually inspected, and filtered objects for ambiguity, sparsity, poor reconstruction, and class imbalance, yielding a final tally of 2,902 objects in 15 everyday categories, each represented as a point cloud (bag, bed, bin, box, cabinet, chair, desk, display, door, pillow, shelf, sink, sofa, table, toilet). The dataset provides multiple variants of increasing complexity to simulate different real-world challenges:

OBJ_ONLY: Segmented objects with no extra background.
OBJ_BG: Objects plus any background points within the ground-truth bounding box.
PB_T25, PB_T25_R, PB_T50_R, PB_T50_RS: Progressive perturbations with random translation (up to ±50% of box size), random yaw-rotation, and uniform scaling. PB_T50_RS (“perturbed box, 50% translation, random scale”) is the most widely used and stringent variant for benchmarking.

Each sample is uniformly down-sampled to 1,024 points, centered, and normalized to [–1,1]³ across the (x, y, z) axes, with all points provided without normals (Uy et al., 2019).

2. Benchmark Protocol and Evaluation

The canonical protocol consists of an 80/20 (train/test) split, taking care to prevent scene overlap across splits. No validation set is used in the original release; models are trained to convergence and tested on the holdout set. The principal evaluation metrics:

Overall Accuracy (OA): Proportion of correctly classified samples, $\mathrm{OA} = \frac{1}{N}\sum_{i=1}^N \mathbf{1}\{\hat y_i = y_i\}$ .
Mean Class Accuracy (mAcc): Average per-class accuracy, $\mathrm{mAcc} = \frac{1}{C}\sum_{c=1}^C \frac{1}{N_c}\sum_{i:y_i=c}\mathbf{1}\{\hat y_i=y_i\}$ (Uy et al., 2019).

The perturbation variants (PB_T50_RS) serve as the standard for reporting state-of-the-art performance due to their high occlusion, background clutter, and class imbalance.

3. Baseline Methods and Initial Results

The initial benchmarking included representative point-based and voxel-based models:

3DmFV (Fisher vector encoding + 3D CNN)
PointNet, PointNet++ (global MLP, hierarchical set abstraction)
DGCNN (EdgeConv local graph MLP)
SpiderCNN, PointCNN (learnable geometric filters, X-conv aggregation)

When trained and evaluated in-domain (real-to-real), performance on the hardest split (PB_T50_RS) demonstrated a consistent drop relative to synthetic benchmarks such as ModelNet40:

Method	PB_T50_RS OA (%)
PointNet	68.2
PointNet++	77.9
DGCNN	78.1
PointCNN	78.5

Training exclusively on ModelNet40 and testing on ScanObjectNN led to severe drops (e.g., DGCNN: 36.8% OA), highlighting the dataset’s challenging nature and large domain gap.

4. State-of-the-Art Methods and Advances

A series of advances have rapidly improved benchmark accuracy, primarily by leveraging architectural innovations, improved training strategies, pretraining, transfer learning, and more effective regularization:

4.1 MLP and Hybrid Architectures

PointMLP: Demonstrated that deep, strictly MLP-based hierarchies (with residual and geometric affine modules) can achieve high accuracy (85.4% on PB_T50_RS), challenging the orthodoxy of locality via convolutions or graphs (Ma et al., 2022).
PointNeXt: A systematic overhaul of PointNet++ revealed that training protocol, model scaling, and residual/separable MLPs yield 87.7% OA (PB_T50_RS), outperforming PointMLP. Integrating local-neighborhood features and weight-averaging further boosted robustness to 88.6% (Qian et al., 2022, Sheshappanavar et al., 2022).
Point-LN: Introduced non-parametric positional encoding (trigonometric and Gaussian), farthest-point sampling, and a lightweight classifier to achieve 91.7% on PB-T50-RS with only 0.8M parameters (Mohammadi et al., 24 Jan 2025).
PointWavelet: Spectral-domain learning with graph wavelets and self-attention achieved 87.9% (OA), confirming the value of frequency-domain geometric modeling under noise (Wen et al., 2023).

4.2 Convolutional and Kernel-Based Networks

KPConvX/KPNeXt: Modern kernel point convolutions with depthwise grouping and kernel attention, deeper inverted bottleneck blocks, and contemporary scaling/training protocols led KPNeXt-L to OA 89.3%/mAcc 88.1% on PB_T50_RS (Thomas et al., 2024).
Dense-Resolution Networks: Fusion of adaptive, density-aware grouping, and error-minimizing modules outperformed earlier convolutional designs, achieving 80.3% OA (Qiu et al., 2020).

4.3 Pretraining and Parameter-Efficient Transfer

MoST (Monarch Sparse Tuning): Parameter-efficient transfer via structured, sparse updates within pretrained transformers (e.g., PointGPT). Achieved 97.5% OA on PB_50_RS with only 2.2% of parameters tuned (b=8), exceeding full fine-tuning and all prior PEFT approaches (Han et al., 24 Mar 2025).
Asymmetric Dual Self-Distillation: A hybrid of latent-space masked modeling, multi-mask/crop strategies, and invariance-based self-distillation, setting a new PB_T50_RS OA of 90.53% (improving to 93.72% with large-scale mixture pretraining) (Leijenaar et al., 26 Jun 2025).

4.4 2D Knowledge Transfer and Alternative Paradigms

P2P (Point-to-Pixel Prompting): Projecting 3D point clouds to geometry-preserved images and prompting pretrained 2D backbones (ImageNet-pretrained ViT, ResNet, HorNet) yielded 89.3% OA on PB-T50-RS using only a small set of trainable parameters (Wang et al., 2022).
DiffCLIP / TeGA for Zero-Shot: Bridging domain gaps using diffusion-based 3D→2D style transfer and text-conditioned augmentation; DiffCLIP reached 43.2% zero-shot OA on OBJ_BG (Shen et al., 2023), while TeGA showed +4.6 pp zero-shot gain by filtering synthetic 3D data (Torimi et al., 16 Jan 2025).

5. Data Augmentation, Regularization, and Training Protocols

Robust performance on ScanObjectNN is strongly associated with sophisticated augmentation and regularization techniques:

Patch-level mixing (PointPatchMix): Generated realistic local hybrids, directly benefiting performance, particularly under heavy occlusion (e.g. OBJ-ONLY, +2.8% accuracy over baseline) by leveraging content-based target interpolation informed by teacher attention maps and patch significance (Wang et al., 2023).
Versatile augmentation: Resampling, scaling, translation, color dropping, label smoothing, and sophisticated learning rate schedules have been systematically studied, contributing up to +8.2% OA over “naive” training in PointNeXt (Qian et al., 2022).
Background-aware modules: Explicit segmentation heads and multi-task loss formulations in BGA-PN++ and BGA-DGCNN improved discriminative capacity under clutter (e.g., PointNet++: 77.9% → 80.2%) (Uy et al., 2019).

6. Open Challenges, Misconceptions, and Future Research

6.1 Domain Adaptation and Sim2Real

Models trained on synthetic datasets (e.g., ModelNet40) yield <50% accuracy when transferred to ScanObjectNN. Closing this domain gap through adaptation or joint synthetic-real representation learning remains an open challenge (Uy et al., 2019).

6.2 Partiality and Occlusion Robustness

Partial, occluded, and noisy views are typical in real scans but rare in synthetic data. Models such as PointPatchMix and hierarchical multi-crop distillation methods address these issues, but general partiality-robust representations are an ongoing research area (Wang et al., 2023, Leijenaar et al., 26 Jun 2025).

6.3 Scalability and Efficient Deployment

Recent progress includes models that achieve state-of-the-art performance with sub-million parameter counts (e.g., Point-LN), parameter-efficient fine-tuning with zero inference overhead (MoST), and real-time operation for resource-constrained scenarios (Han et al., 24 Mar 2025, Mohammadi et al., 24 Jan 2025).

6.4 Joint Segmentation–Classification and Multimodal Integration

Foreground/background segmentation-aware modules and multimodal contrastive approaches (language/image/3D) are gaining attention, with the goal of enabling more interpretable, generalizable, and multi-task 3D scene understanding (Uy et al., 2019, Torimi et al., 16 Jan 2025).

In summary, ScanObjectNN is now an established, challenging benchmark for 3D point cloud object classification in the presence of noise, occlusion, and background clutter. Progress has been marked by advances in architecture, pretraining, augmentation, and parameter-efficient transfer. Despite saturation in some synthetic benchmarks, ScanObjectNN continues to drive the development of robust, scalable, and generalizable 3D recognition methods (Uy et al., 2019).