ADBENCH-3D: 3D Anomaly Detection Benchmark
- ADBENCH-3D is a benchmark for comprehensive 3D anomaly detection and segmentation that integrates multiview and multimodal data for single-instance training.
- It leverages high-resolution scans, CAD-based synthetic data, and dense meshes to evaluate performance using metrics like I-AUROC and V-AUPRO.
- The framework exposes open challenges such as domain shift, occlusion in multiview fusion, and scalable voxel-based anomaly localization.
ADBENCH-3D (SiM3D) defines a rigorous benchmark for comprehensive 3D anomaly detection and segmentation (ADS), focused on the integration of multiview and multimodal information in the context of single-instance training. The framework encompasses voxel-based anomaly localization and global detection, quantifies generalization from synthetic to real data, and establishes reference baselines via state-of-the-art single-view methods extended to the multiview 3D regime. ADBENCH-3D advances the state of evaluation in industrial-grade scenarios where only a single nominal object—real or synthetic—is available for learning, and captures open research challenges in domain shift and volumetric reasoning (Costanzino et al., 26 Jun 2025).
1. Dataset Composition and Acquisition
ADBENCH-3D comprises 333 object instances drawn from eight categories: Plastic Stool, Rubbish Bin, Wicker Vase, Bathroom Furniture, Container, Plastic Vase, Wooden Stool, and Sink Cabinet. Per-category instance counts vary significantly, with Wooden Stool at 15 objects and Plastic Vase at 99. Each type provides a single CAD prototype. Data acquisition utilizes ZEISS Atos Q stereo cameras—grayscale images at 4096×3000 px (12 Mpx) and point clouds with up to approximately 12 million points per view, producing integrated meshes of 5–7 million vertices.
Instances are captured in 12–36 rigidly-calibrated views distributed over a 360° hemisphere, with known camera intrinsics and per-view extrinsics . In addition to real scans, each CAD model generates a matched synthetic split: grayscale renderings and depth maps at identical camera parameters. Synthetic point clouds are re-projected from these depth maps, providing a CAD-based control for domain generalization studies.
2. Task Formulation and Representation
The core problem is single-instance, multiview, multimodal 3D anomaly detection and segmentation (ADS). For each class , learning is restricted to a single nominal instance (real or synthetic). At test time, a novel object yields images , point clouds or depth maps , and an integrated mesh . The objective is twofold:
- Produce a global anomaly score —for instance-level detection;
- Compute a voxel-based Anomaly Volume , with each voxel representing the predicted probability of an anomaly at that 3D location.
Meshes are discretized into fixed-size voxel grids (voxel size 0 mm). Ground-truth anomaly volumes 1 are derived by voxel–triangle intersection with manually annotated meshes. A single-view 2D anomaly map 2 is projected into 3D by: 3 followed by discretization. Multiview aggregation employs voxel-wise max pooling: 4 establishing a consistent volumetric anomaly prediction.
3. Evaluation Metrics
ADBENCH-3D defines new and adapted metrics for detection and volumetric segmentation:
- Instance-level AUROC (I-AUROC): ROC area under curve from global score 5 over the test suite, distinguishing nominal and anomalous samples.
- Per-Region Overlap (PRO) and V-AUPRO6: Extending 2D PRO to 3D, for threshold 7 and 8 distinct anomaly regions:
9
where 0, and 1 is the 2-th connected ground truth region. V-AUPRO3 integrates the PRO–FPR curve to 4, normalized to 5:
6
- Volumetric IoU, Precision, Recall: For binary predictions 7:
8
Metrics are always computed on the fused anomaly volume, enabling direct comparison between single/modality and multiview variants, and facilitating real-to-real vs. synthetic-to-real (synth2real) comparisons that quantify domain shifts.
4. Benchmark Protocol and Ground Truth Construction
The protocol stipulates, for each object type:
- real2real: train on one real nominal instance (all views)
- synth2real: train on rendered views from the CAD model
The test set comprises the remaining 20–100 instances per category, balanced between nominal and anomalous samples with identical splits across training regimes. Ground-truth annotation involves four stages:
- Manual 2D per-view masks (unique ID per defect);
- Project and lift these to the mesh 9 using camera parameters;
- Manual 3D refinement in CloudCompare;
- Voxelization at 2 mm resolution to create 0.
Significant generalization challenges are present: single-instance learning (no inter-nominal variability), domain transfer (synthetic → real object differences in both appearance and geometry), and multiview fusion (occlusion, consistency among views).
5. Baseline Methods and Experimental Results
Five state-of-the-art single-view ADS methods are extended to operate within this benchmark’s multiview, 3D regime:
- Unimodal (Image or Depth):
- PatchCore (WideResNet-101, DINO-v2 backbones; memory-bank k-NN)
- EfficientAD (teacher–student with WideResNet-101)
- Multimodal (Image + 3D):
- BTF: 2D (WideResNet-101) and 3D (FPFH) features concatenated in a memory-bank
- M3DM: transformer-based feature extraction per modality, fused via OC-SVM (with FPFH for 3D)
- CFM: cross-modal mapping (MLP regressors: 2D→3D, 3D→2D, using DINO-v2 + FPFH)
Inputs are downsampled (1 px); segmentation is performed by max-pooled aggregation of 2D anomaly maps into the 3D anomaly volume 2.
Mean results over all classes:
| Training | Best Detection I-AUROC | Best Segmentation V-AUPRO@1% |
|---|---|---|
| real2real | ≃ 0.754 (PatchCore + WRN-101) | ≃ 0.671 (PatchCore + DINO-v2) |
| synth2real | ≃ 0.540 (PatchCore + DINO-v2) | ≃ 0.600 (PatchCore + WRN-101) |
Principal observations include:
- Image-only methods (PatchCore, EfficientAD) outperform direct multimodal extensions, especially on large point clouds (∼2M points after downsampling).
- Supplementing multimodal baselines with down-projected depth and DINO-v2 features provides partial improvement, but does not surpass pure RGB.
- The synthetic-to-real performance gap is substantial (−20–25 pts in I-AUROC).
- Memory-bank (non-parametric) methods excel in single-instance settings compared to full-training paradigms.
These trends indicate the difficulty of robust domain adaptation and multiview fusion in high-fidelity 3D data.
6. Impact, Open Challenges, and Research Directions
ADBENCH-3D exposes several critical challenges in 3D ADS:
- Achieving robust anomaly detection with single-instance training (no inter-nominal diversity) remains difficult, favoring non-parametric approaches.
- The synthetic-to-real domain gap, driven by both appearance and geometric disparities, is unresolved by current methods.
- Existing 3D backbones underperform on multi-million-point clouds, highlighting a need for scalable, expressive volumetric or geometric architectures.
- End-to-end multiview fusion, able to effectively aggregate correlated evidence and resolve view/occlusion ambiguities, is mostly unsolved.
- Stronger synth2real strategies, potentially leveraging self-supervision or domain adaptation, are required.
ADBENCH-3D establishes a comprehensive, high-resolution platform to advance research in rigorous, real-world 3D anomaly detection and segmentation under stringent training constraints (Costanzino et al., 26 Jun 2025).