MVPNet Dataset: Real-World 3D Object Reconstructions

Updated 2 March 2026

MVPNet is a real-world 3D point cloud dataset featuring dense, colored reconstructions and rich geometric complexity from everyday objects.
It employs a multi-stage reconstruction pipeline—combining SfM, MVS, back-projection, and point cloud fusion—to deliver high-fidelity geometric data.
MVPNet supports robust 3D object classification and transfer learning, with benchmark results demonstrating significant accuracy improvements over traditional methods.

MVPNet is a large-scale, real-world 3D point cloud dataset designed to advance object-centric 3D understanding within the computer vision community. Derived from MVImgNet—an extensive multi-view image dataset—MVPNet provides dense, colored 3D reconstructions for a wide variety of everyday objects, offering rich class diversity and real-scanned geometric complexity. It is specifically constructed to address limitations of synthetic benchmarks by supplying high-fidelity object point clouds captured under practical, unconstrained conditions (Yu et al., 2023).

1. Construction Pipeline

The MVPNet dataset is generated via a systematic multi-stage reconstruction pipeline based on the MVImgNet video corpus. Each input video comprises a sequence of RGB frames accompanied by camera intrinsics ( $K$ ), extrinsics ( $R$ , $t$ ), and foreground segmentation masks ( $M$ ). The reconstruction procedure involves:

Sparse Structure-from-Motion (SfM): COLMAP SfM is applied to select frames, yielding estimated camera poses $\{R_i, t_i\}$ and intrinsics $K_i$ for every view $i$ .
Dense Multi-View Stereo (MVS): COLMAP’s PatchMatch MVS algorithm produces dense per-pixel depth maps $D_i(u,v)$ and surface normals $N_i(u,v)$ for each observed view.
Segregation and Back-projection: Binary masks $M_i(u,v)$ restrict reconstruction to object pixels. Each valid image pixel is back-projected into world coordinates as

$P_i(u,v) = R_i^\top [K_i^{-1}(u,v,1)^\top \cdot D_i(u,v)] - R_i^\top t_i$

where $(u,v)$ is the 2D image coordinate.

Point Cloud Fusion: Aggregate the back-projected points $P_i(u,v)$ across all frames, with optional view-angle weighting $w_i(u,v) = \max(0,N_i \cdot v_i)$ , where $v_i$ denotes the camera’s viewing direction.
Manual Cleaning: Outlier pruning eliminates reconstructions with excessive noise or insufficient points; residual background is removed manually.
Final Output: One dense, colored point cloud (including surface normals and RGB values) is generated for each source video.

This approach ensures high-fidelity geometric capture directly from actual image data, providing object-centric point clouds with realistic variation in appearance and geometry.

2. Dataset Statistics

MVPNet comprises $87,200$ object-centric point cloud samples across $150$ real-world human-centric object classes. Category distribution averages approximately $581$ point clouds per category (range: $100$–$1500$ per class).

The benchmark split is as follows:

Training: $64,000$ point clouds ( $\sim 80\%$ )
Testing: $16,000$ point clouds ( $\sim 20\%$ )
Users may optionally carve out a validation set from the training samples.

Each point cloud features hundreds of thousands of points, color attributes, and geometric normals, capturing the diversity inherent to unconstrained multi-view capture (Yu et al., 2023).

3. Annotation Format and Directory Structure

All MVPNet samples are distributed in the PLY format (ASCII or binary), each file representing a unique object instance. The data organization and file annotation are as follows:

Point Attributes:
- $x, y, z$ : 3D coordinates ( $\text{float32}$ )
- $r, g, b$ : Vertex colors ( $\text{uint8}$ ), directly inherited from source images
- $n_x, n_y, n_z$ : Surface normals ( $\text{float32}$ ) estimated by MVS
Class label:
- Each point cloud is assigned an integer class ID in $[0,149]$ , stored in the PLY header or accompanying metadata.
Directory Structure:

MVPNet/
  ├── train/
  │   ├── class_000/
  │   │   ├── obj0001.ply
  │   │   ├── obj0002.ply
  │   │   └── ...
  │   └── class_001/
  └── test/ (similarly structured)

This structured arrangement facilitates efficient parsing by machine learning libraries and supports scalable benchmarking.

4. Preprocessing and Normalization Procedures

To standardize MVPNet point clouds for algorithmic consumption, the following preprocessing sequence is recommended:

Centering: Subtract the point centroid to align each object at the coordinate origin:

$P' = P - \bar{P}\,, \quad \bar{P} = \frac{1}{N}\sum_{j=1}^N P_j$

Scaling: Normalize all points to reside within the unit sphere:

$s = \max_j \|P'_j\|_2,\quad P'' = P'/s$

Downsampling: Optionally sub-select or voxel-grid filter points to a fixed budget (e.g., $1,024$ or $2,048$ points per cloud).
Data Augmentation: Random rotation, additive Gaussian noise (jitter, $\sigma \approx 0.01$ ), and point dropout are suggested during training.

These steps ensure geometric invariance, mitigate overfitting, and permit compatibility with existing 3D deep learning pipelines (Yu et al., 2023).

5. Baseline Benchmarks and Protocols

MVPNet supports both in-dataset benchmarking and transfer learning evaluation:

A. 150-way Object Classification on MVPNet

Training/testing split: $64,000$/$16,000$
Metrics: Overall Accuracy (OA), Mean Class Accuracy (mAcc)
Representative results:

Method	OA (%)	mAcc (%)
PointNet	70.72	54.46
PointNet++	79.15	58.24
DGCNN	86.49	63.98
PointMLP	88.89	73.64
CurveNet	88.88	75.37
GDANet	89.54	68.41
PAConv	83.35	59.13
PCT (Transformer)	91.49	75.41

B. Transfer to ScanObjectNN

Pretraining on MVPNet enables tangible gains on external real-world benchmarks. For instance, PointNet++ trained from scratch on the ScanObjectNN PB_T50_RS split yields $76.50\%$ OA and $73.42\%$ mAcc, while pretraining on MVPNet increases these metrics to $78.76\%$ OA and $76.54\%$ mAcc.

This substantiates MVPNet's utility for developing transferable 3D representations and real-scan robustness (Yu et al., 2023).

6. Applications, Limitations, and Recommended Practices

Applications:

Real-world 3D object classification and retrieval
Self-supervised pretraining for partial-scan completion or understanding
Single-view 3D reconstruction (learning geometric priors)
Robotics tasks such as grasping and pose estimation

Limitations:

Objects are captured with only $180^\circ$ multi-view coverage per instance, resulting in incomplete back surfaces.
The taxonomy is human-centric (everyday objects), with under-representation of natural or biological categories.
Point density and noise levels vary, reflecting MVS artifacts and real-world imaging imperfections.
No scene-level or contextual data; all samples are strictly object-centric (Yu et al., 2023).

Recommended Practices:

Consistently apply centering and unit-sphere normalization.
Employ aggressive rotation and jitter augmentation for scratch training.
Use a small learning rate for pretrained layers in transfer learning.
Validate both on MVPNet's internal test split and on external real-scan datasets (e.g., ScanObjectNN) to assess generalization.
Fuse MVPNet pretraining with synthetic datasets such as ShapeNet and ModelNet to increase geometric coverage.

A plausible implication is that MVPNet, as a real-data benchmark, serves as an indispensable complement to synthetic CAD datasets—enabling the development and evaluation of robust, transferable 3D recognition models.

Markdown Report Issue Upgrade to Chat

References (1)

MVImgNet: A Large-scale Dataset of Multi-view Images (2023)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to MVPNet Dataset.