- The paper introduces MVImgNet2.0, a new dataset doubling the scale of its predecessor with 520,000 multi-view objects across 515 categories to advance large-scale 3D vision.
- MVImgNet2.0 features significantly enhanced data quality through accurate object masks, precise camera poses using PixSfM, and dense point clouds generated with Neural-Angelo.
- Experimental validation shows MVImgNet2.0 improves per-scene and category-agnostic 3D reconstruction performance for models like LGM, LRM, Instant-NGP, and 3D Gaussian Splatting compared to synthetic data.
MVImgNet2.0: A Larger-Scale Dataset of Multi-View Images
The paper under review introduces MVImgNet2.0, an expansive dataset comprising multi-view images tailored to address challenges in 3D vision and object reconstruction. This work significantly builds upon its predecessor, MVImgNet, by doubling the dataset's scale and diversifying its categorical breadth. MVImgNet2.0 consists of approximately 520,000 objects spread across 515 categories, offering a wealth of high-quality annotations and aiming to bridge the gap between 2D and 3D datasets akin to ImageNet in terms of scale and quality.
Key Features of MVImgNet2.0
MVImgNet2.0 distinguishes itself through several advancements over its predecessor:
- Increased Scale and Diversity: MVImgNet2.0 extends the dataset to about 520,000 objects, encompassing 515 categories. This scale renders it substantial enough for training large-scale 3D models, comparable to prominent 2D datasets such as ImageNet.
- Enhanced Multi-View Shooting: The dataset primarily captures objects in 360-degree views, contributing to a richer learning environment for comprehensive object reconstruction and ensuring better modeling of object geometry and textures.
- Improved Data Quality: The quality of annotations in MVImgNet2.0 has been significantly enhanced:
- Accurate Object Masks: Object segmentation is refined through a sophisticated detection-segmentation-tracking pipeline, utilizing tools like Grounding-DINO and SAM.
- Precise Camera Poses: The use of Pixel-Perfect Structure-from-Motion (PixSfM) contributes to lower estimation errors in determining camera poses during the data collection phase.
- Dense Point Clouds: Advanced techniques using Neural-Angelo ensure the generation of high-quality point clouds from multi-view data.
Experimental Validation
The paper provides extensive experimental validation demonstrating the utility of MVImgNet2.0 in the field of 3D reconstruction:
- Per-Scene 3D Reconstruction: Methods such as Instant-NGP and 3D Gaussian Splatting show increased performance when utilizing the more precise camera pose annotations from MVImgNet2.0. This suggests improved visual fidelity in rendered scenes.
- Category-Agnostic Reconstruction: The paper evaluates state-of-the-art 3D reconstruction models like LGM and LRM. Training with MVImgNet2.0 data provides superior reconstruction quality over synthetic datasets, underscoring the importance of real-world variability in datasets. Crucially, MVImgNet2.0's higher-quality annotations contribute toward enhanced model performance.
- Utility of 360-Degree Views: The inclusion of 360-degree object captures contributes to more complete training data, facilitating better object shape understanding and reconstruction.
Implications and Future Directions
The development of MVImgNet2.0 carries notable implications for the field of computer vision, particularly for the training and enhancement of 3D vision models. The dataset's scale and quality can potentially usher advancements in multi-view modeling, shape reconstruction, and novel view synthesis. Future work may explore incorporating even more complex objects, dynamic scenes, and enhanced annotation fidelities, pushing the boundaries of automated reconstruction and understanding within real-world contexts. Additionally, the dataset holds promise for broader applications, including robotic perception, augmented reality, and autonomous systems.
Overall, MVImgNet2.0 represents a valuable resource that equips researchers and practitioners with a robust foundation for 3D-related tasks, reinforcing the advancement of modern visual computing technologies.