Common Objects in 3D: Large-Scale Learning and Evaluation of Real-life 3D Category Reconstruction (2109.00512v1)

Published 1 Sep 2021 in cs.CV

Abstract: Traditional approaches for learning 3D object categories have been predominantly trained and evaluated on synthetic datasets due to the unavailability of real 3D-annotated category-centric data. Our main goal is to facilitate advances in this field by collecting real-world data in a magnitude similar to the existing synthetic counterparts. The principal contribution of this work is thus a large-scale dataset, called Common Objects in 3D, with real multi-view images of object categories annotated with camera poses and ground truth 3D point clouds. The dataset contains a total of 1.5 million frames from nearly 19,000 videos capturing objects from 50 MS-COCO categories and, as such, it is significantly larger than alternatives both in terms of the number of categories and objects. We exploit this new dataset to conduct one of the first large-scale "in-the-wild" evaluations of several new-view-synthesis and category-centric 3D reconstruction methods. Finally, we contribute NerFormer - a novel neural rendering method that leverages the powerful Transformer to reconstruct an object given a small number of its views. The CO3D dataset is available at https://github.com/facebookresearch/co3d .

View on arXiv

Authors (6)

Jeremy Reizenstein (15 papers)
Roman Shapovalov (15 papers)
Philipp Henzler (18 papers)
Luca Sbordone (16 papers)
Patrick Labatut (13 papers)
David Novotny (42 papers)

Citations (377)

View on Semantic Scholar

Summary

Overview of "Common Objects in 3D"

The paper "Common Objects in 3D (CO3D): Large-Scale Learning and Evaluation of Real-life 3D Category Reconstruction" presents a substantial contribution to the field of 3D object reconstruction by introducing a large-scale dataset and a novel model for 3D rendering. The authors address significant limitations in the current availability of real-world datasets and advance methodologies for learning category-centric 3D models.

Dataset Contribution

CO3D represents a massive leap in dataset size and realism, comprising 1.5 million multi-view images of nearly 19,000 objects across 50 categories derived from MS-COCO. Unlike prior datasets that often rely on synthetic or limited real-world data, CO3D provides extensive real-world images with annotated camera poses and dense 3D point clouds. The dataset's scale allows for training and evaluation of more robust 3D reconstruction models. The collection process leverages crowd-sourcing and photogrammetric techniques, producing high-quality annotations efficiently. This results in a dataset that more accurately reflects the complexity found in real-life scenes.

Evaluation and Novel Model

The paper exploits CO3D to undertake one of the earliest extensive evaluations of existing new-view-synthesis and 3D reconstruction methods in "in-the-wild" conditions. Significantly, the authors introduce NerFormer, a neural rendering model utilizing Transformer architectures to enhance implicit neural representations. NerFormer is designed to synthesize views from sparse image sequences, employing attention mechanisms for both spatial reasoning and view aggregation. The model demonstrates superior performance over existing baselines, including implicit and explicit methods such as Neural Radiance Fields (NeRF), Neural Volumes (NV), and more traditional mesh and point cloud techniques.

Numerical Results and Methodological Insights

The paper presents strong empirical results, with NerFormer outperforming 14 baseline models in several metrics, such as PSNR, LPIPS, and IoU, indicating its effectiveness in reconstructing accurate and visually coherent 3D objects. By employing a Transformer-based approach, NerFormer achieves a better balance of detail and computational efficiency, leveraging the strengths of implicit neural representations while mitigating weaknesses in handling noisy inputs.

Implications and Future Directions

The introduction of CO3D and related methodologies sets a new benchmark for real-world 3D reconstruction tasks. The dataset's scale and diversity pave the way for more generalizable and robust models that can operate effectively in varied and complex environments. This work suggests pathways for future research, including exploring more scalable annotation systems, enhancing model generalizability across unseen categories, and integrating advancements in neural rendering technologies.

In conclusion, this paper makes valuable contributions through both data and model innovations, significantly impacting future research trajectories in 3D category reconstruction and rendering.