3D-Only Datasets Overview

Updated 12 April 2026

3D-only datasets are collections where all annotations are in 3D space, including point clouds, meshes, and depth maps, enabling focused 3D vision research.
They are curated using synthetic generation, real multi-view captures, and crowdsourced pipelines to ensure dense, high-quality, and task-aligned data.
Key applications include 3D reconstruction, recognition, segmentation, and telepresence, with performance assessed by metrics like mIoU, mAP, and RMSE.

A 3D-only dataset is a collection in which all annotated data are fundamentally tied to three-dimensional representations, such as meshes, point clouds, depth maps, volumetric video, or 3D geometric annotations. Unlike multimodal datasets where 3D annotations may accompany 2D imagery, 3D-only datasets are curated such that all meaningful annotations, ground truths, and tasks are specified in 3D space. These datasets are critical to research in 3D reconstruction, recognition, segmentation, generative modeling, video, robotic interaction, and spatial reasoning, as well as for the training and benchmarking of deep learning models that operate directly on 3D data.

1. Dataset Typology and Modalities

3D-only datasets span a wide array of modalities, each tailored to specific research tasks. Common forms include:

Point clouds: Collections of $\mathbb{R}^3$ vertex locations, optionally with color or intensity. Point clouds are the foundation for geometric learning, segmentation, and upsampling tasks (e.g., WHU-Synthetic (Zhou et al., 2024), UniG3D (Sun et al., 2023)).
Meshes: Polygonal surface representations with vertex connectivity, suitable for shape analysis and mesh-based learning (e.g., UniG3D (Sun et al., 2023)).
Volumetric video and spatial video: Temporally sequenced 3D data, either as stereoscopic video (SVD (Izadimehr et al., 6 Jun 2025)) or as reconstructed 3D volumes.
Depth maps: Per-pixel float or integer maps representing distance from the camera plane, instrumental in depth completion and multi-view synthesis (VCVW-3D (Ding et al., 2023), WHU-Synthetic (Zhou et al., 2024)).
3D bounding boxes: Axis-aligned or rotated parameterizations $(x, y, z, w, h, l, \theta)$ , ubiquitous in detection datasets (VCVW-3D (Ding et al., 2023)).

Auxiliary metadata may include 3D part labels, instance segmentation, camera/extrinsic parameters, object category, textual description, and dynamic annotations (e.g., motion, temporal loop labels).

2. Generation and Annotation Pipelines

The pipeline for constructing a 3D-only dataset depends on the nature of source data (synthetic, real, or crowdsourced) and the intended task.

Synthetic Geometry Assembly

Datasets such as Primitive3D (Li et al., 2022) employ algorithmic object generation. Objects are assembled programmatically from parameterized primitives (e.g., box, sphere, cylinder, cone, torus):

Each primitive type $\Psi_i$ is parameterized by geometric attributes $\theta \in \Theta^{\Psi_i}$ .
Instances are instantiated as $\psi'_\theta = \lambda \cdot (\Phi \psi_\theta) + \delta$ (where $\Phi \in \mathrm{SO}(3)$ is random rotation, $\lambda \in \mathbb{R}^+$ is scale, and $\delta \in \mathbb{R}^3$ is translation).
Compositional modeling: Hierarchical binary trees combine primitives with Boolean operations (e.g., regularized union $* \cup$ ), sampled uniformly to maximize diversity and validity by closure/approximability theorems.

Real and Crowdsourced Multi-View Capture

Real-object datasets such as uCO3D (Liu et al., 13 Jan 2025) and spatial video datasets like SVD (Izadimehr et al., 6 Jun 2025) utilize coordinated capture:

uCO3D: Crowdsourced 360° multi-view scans are collected according to prescribed trajectories, with strict quality control for viewpoint coverage and resolution. Structure-from-motion (VGGSfM) yields dense and sparse point clouds, while visual-LLMs assign captions and semantic masks.
SVD: Utilizes calibrated consumer-grade devices (e.g., iPhone Pro, Apple Vision Pro) for synchronized stereo video capture, recording extrinsics/intrinsics and rendering per-frame disparity and depth.

Synthetic Environment Simulation

Synthetic environments (e.g., WHU-Synthetic (Zhou et al., 2024), VCVW-3D (Ding et al., 2023)) leverage photorealistic simulators (CARLA, Unity) for data creation. Agents and lighting conditions are randomized; point clouds, depth, segmentation masks, and camera metadata are rendered and stored per sample.

3. Data Quality Controls and Annotation Strategies

Annotation in 3D-only datasets prioritizes precision and density to enable robust downstream training.

Automated semantic and instance labeling: Primitive3D (Li et al., 2022) provides per-point semantic ( $y_t$ ) and unique instance labels ( $(x, y, z, w, h, l, \theta)$ 0) for each primitive, supporting fine-grained segmentation benchmarking.
Multimodal ground truth: uCO3D (Liu et al., 13 Jan 2025) attaches scene captions (LLAMA3 summaries of per-frame BLIP descriptions), dense/sparse point clouds, and 3D Gaussian Splat (3DGS) models, ensuring each object-scene instance is comprehensively described.
Data curation and filtering: UniG3D (Sun et al., 2023) removes models with “flat” or spurious geometry via analysis of point cloud singular values ( $(x, y, z, w, h, l, \theta)$ 1), and ensures image-text coherence by CLIP-based filtering ( $(x, y, z, w, h, l, \theta)$ 2).

4. Benchmark Tasks and Evaluation Metrics

The design of modern 3D-only datasets reflects the breadth of benchmarks they serve:

Recognition and segmentation: Tasks include per-point semantic/instance segmentation ( $(x, y, z, w, h, l, \theta)$ 3, mIoU measures), and cross-dataset linear SVM evaluation on frozen representations (Primitive3D (Li et al., 2022)).
3D object detection: 3D bounding box metrics (e.g., mAP@ $(x, y, z, w, h, l, \theta)$ 4, NDS) follow conventions from nuScenes and KITTI (VCVW-3D (Ding et al., 2023)).
Reconstruction and view synthesis: Multiview consistency (LPIPS, PSNR, IoU on held-out scenes) is the standard for model comparison (uCO3D (Liu et al., 13 Jan 2025)).
Multi-task evaluation: WHU-Synthetic (Zhou et al., 2024) is uniquely aligned for depth completion (RMSE, MAE), scene-level upsampling (Chamfer, Hausdorff Distance), place recognition (AR@K), semantic segmentation (mIoU), and odometry (ATE, ARE).
Video and spatiotemporal benchmarks: SVD (Izadimehr et al., 6 Jun 2025) provides per-frame SSIM, disparity histograms, Spatial and Temporal Information (SI, TI), and supports 3D codec rate-distortion benchmarking.

5. Statistical Properties and Scaling

Comprehensive statistics are reported for major 3D-only datasets:

Dataset	Scale (Objects/Scenes)	Coverage	Annotation Types
Primitive3D	150k objects	Synthetic CSG	Point cloud, part/instance
uCO3D	170k videos, 1,070 cat	Full 360° real	Camera pose, depth, 3DGS, masks
UniG3D	550k models	Open taxonomy	Mesh, image, pcloud, caption
WHU-Synthetic	>140k frames, 11 towns	Synthetic city	Multisensor, density, mesh, seg
VCVW-3D	375k stereo frames	Construction	3D box, depth, mask, camera
SVD	300+ video sequences	Consumer video	Stereo, disparity, SI/TI, metad.

All counts and modalities as specified in the source datasets; critical distinctions include simulated vs. real, object-centric vs. scene-centric, and single-task vs. multi-task alignment.

6. Dataset-Specific Methodological Innovations

Several methodological advances are uniquely enabled or introduced by 3D-only datasets.

Dataset distillation: Primitive3D (Li et al., 2022) leverages Maximum Mean Discrepancy (MMD) minimization to select training subsets matching a given target domain, reducing pretraining time by ≈86% at ≤1% loss in accuracy.
Unified transformation pipelines: UniG3D (Sun et al., 2023) demonstrates generic conversion from any raw mesh collection to a four-modality dataset (mesh, image, point cloud, text) using Blender and BLIP/CLIP-based filtering, generalizable to other repositories.
Co-located, multi-density LiDAR: WHU-Synthetic (Zhou et al., 2024) provides perfectly aligned 16/32/64/128-channel LiDAR per frame, enabling dense upsampling/segmentation training without ad-hoc spatial alignment artifacts.
Full 360° scene capture and 3DGS supervision: uCO3D (Liu et al., 13 Jan 2025) delivers dense geometric and photometric ground truth for few-view and text-to-3D pipelines, validated with photometric and semantic metrics.

7. Impact, Use Cases, and Limitations

3D-only datasets have accelerated progress in foundational 3D vision, generation, and robotics:

Pretraining: Datasets such as Primitive3D (Li et al., 2022) and UniG3D (Sun et al., 2023) demonstrate superior downstream performance, particularly in low-label regimes, due to dense annotation and data diversity.
Transfer learning: Synthetic-real domain adaptation remains challenging (e.g., VCVW-3D (Ding et al., 2023)); augmenting synthetic pretraining with limited real imagery bridges some domain gaps, but differences in photorealism, sensor noise, and modal coverage persist.
Multi-task learning: WHU-Synthetic (Zhou et al., 2024) enables, for the first time, co-aligned benchmarking of depth/segmentation/reconstruction/recognition—all with rigorous scene-level alignment.
3D video and telepresence benchmarks: SVD (Izadimehr et al., 6 Jun 2025) provides the first open-access, consumer-grade spatial video testbed with dense spatial ground truth, facilitating research in stereo encoding, 3D streaming, and neural rendering.

However, limitations persist: category scope in synthetic datasets may not generalize; missing sensor modalities (e.g., LiDAR in VCVW-3D (Ding et al., 2023)) reduce transferability; synthetic-real gaps must be mitigated by hybrid curation strategies.

The emergence of large, multi-modal 3D-only datasets with dense annotations, unified pipelines, and task alignment is enabling advances in both discriminative and generative 3D deep learning, benchmarking, and spatial AI research platforms (Li et al., 2022, Ding et al., 2023, Liu et al., 13 Jan 2025, Zhou et al., 2024, Sun et al., 2023, Izadimehr et al., 6 Jun 2025).