MVImgNet2.0: A Larger-scale Dataset of Multi-view Images (2412.01430v1)

Published 2 Dec 2024 in cs.CV, cs.AI, and cs.GR

Abstract: MVImgNet is a large-scale dataset that contains multi-view images of ~220k real-world objects in 238 classes. As a counterpart of ImageNet, it introduces 3D visual signals via multi-view shooting, making a soft bridge between 2D and 3D vision. This paper constructs the MVImgNet2.0 dataset that expands MVImgNet into a total of ~520k objects and 515 categories, which derives a 3D dataset with a larger scale that is more comparable to ones in the 2D domain. In addition to the expanded dataset scale and category range, MVImgNet2.0 is of a higher quality than MVImgNet owing to four new features: (i) most shoots capture 360-degree views of the objects, which can support the learning of object reconstruction with completeness; (ii) the segmentation manner is advanced to produce foreground object masks of higher accuracy; (iii) a more powerful structure-from-motion method is adopted to derive the camera pose for each frame of a lower estimation error; (iv) higher-quality dense point clouds are reconstructed via advanced methods for objects captured in 360-degree views, which can serve for downstream applications. Extensive experiments confirm the value of the proposed MVImgNet2.0 in boosting the performance of large 3D reconstruction models. MVImgNet2.0 will be public at luyues.github.io/mvimgnet2, including multi-view images of all 520k objects, the reconstructed high-quality point clouds, and data annotation codes, hoping to inspire the broader vision community.

Summary

The paper introduces MVImgNet2.0, a new dataset doubling the scale of its predecessor with 520,000 multi-view objects across 515 categories to advance large-scale 3D vision.
MVImgNet2.0 features significantly enhanced data quality through accurate object masks, precise camera poses using PixSfM, and dense point clouds generated with Neural-Angelo.
Experimental validation shows MVImgNet2.0 improves per-scene and category-agnostic 3D reconstruction performance for models like LGM, LRM, Instant-NGP, and 3D Gaussian Splatting compared to synthetic data.

MVImgNet2.0: A Larger-Scale Dataset of Multi-View Images

The paper under review introduces MVImgNet2.0, an expansive dataset comprising multi-view images tailored to address challenges in 3D vision and object reconstruction. This work significantly builds upon its predecessor, MVImgNet, by doubling the dataset's scale and diversifying its categorical breadth. MVImgNet2.0 consists of approximately 520,000 objects spread across 515 categories, offering a wealth of high-quality annotations and aiming to bridge the gap between 2D and 3D datasets akin to ImageNet in terms of scale and quality.

Key Features of MVImgNet2.0

MVImgNet2.0 distinguishes itself through several advancements over its predecessor:

Increased Scale and Diversity: MVImgNet2.0 extends the dataset to about 520,000 objects, encompassing 515 categories. This scale renders it substantial enough for training large-scale 3D models, comparable to prominent 2D datasets such as ImageNet.
Enhanced Multi-View Shooting: The dataset primarily captures objects in 360-degree views, contributing to a richer learning environment for comprehensive object reconstruction and ensuring better modeling of object geometry and textures.
Improved Data Quality: The quality of annotations in MVImgNet2.0 has been significantly enhanced:
- Accurate Object Masks: Object segmentation is refined through a sophisticated detection-segmentation-tracking pipeline, utilizing tools like Grounding-DINO and SAM.
- Precise Camera Poses: The use of Pixel-Perfect Structure-from-Motion (PixSfM) contributes to lower estimation errors in determining camera poses during the data collection phase.
- Dense Point Clouds: Advanced techniques using Neural-Angelo ensure the generation of high-quality point clouds from multi-view data.

Experimental Validation

The paper provides extensive experimental validation demonstrating the utility of MVImgNet2.0 in the field of 3D reconstruction:

Per-Scene 3D Reconstruction: Methods such as Instant-NGP and 3D Gaussian Splatting show increased performance when utilizing the more precise camera pose annotations from MVImgNet2.0. This suggests improved visual fidelity in rendered scenes.
Category-Agnostic Reconstruction: The paper evaluates state-of-the-art 3D reconstruction models like LGM and LRM. Training with MVImgNet2.0 data provides superior reconstruction quality over synthetic datasets, underscoring the importance of real-world variability in datasets. Crucially, MVImgNet2.0's higher-quality annotations contribute toward enhanced model performance.
Utility of 360-Degree Views: The inclusion of 360-degree object captures contributes to more complete training data, facilitating better object shape understanding and reconstruction.

Implications and Future Directions

The development of MVImgNet2.0 carries notable implications for the field of computer vision, particularly for the training and enhancement of 3D vision models. The dataset's scale and quality can potentially usher advancements in multi-view modeling, shape reconstruction, and novel view synthesis. Future work may explore incorporating even more complex objects, dynamic scenes, and enhanced annotation fidelities, pushing the boundaries of automated reconstruction and understanding within real-world contexts. Additionally, the dataset holds promise for broader applications, including robotic perception, augmented reality, and autonomous systems.

Overall, MVImgNet2.0 represents a valuable resource that equips researchers and practitioners with a robust foundation for 3D-related tasks, reinforcing the advancement of modern visual computing technologies.

PDF Markdown

Related Papers

GitHub

MVImgNet2.0: A Larger-scale Dataset of Multi-view Images

Tweets

https://twitter.com/gm8xx8/status/1863838795340120164