CAD-Estate: Large-scale CAD Model Annotation in RGB Videos (2306.09011v2)

Published 15 Jun 2023 in cs.CV

Abstract: We propose a method for annotating videos of complex multi-object scenes with a globally-consistent 3D representation of the objects. We annotate each object with a CAD model from a database, and place it in the 3D coordinate frame of the scene with a 9-DoF pose transformation. Our method is semi-automatic and works on commonly-available RGB videos, without requiring a depth sensor. Many steps are performed automatically, and the tasks performed by humans are simple, well-specified, and require only limited reasoning in 3D. This makes them feasible for crowd-sourcing and has allowed us to construct a large-scale dataset by annotating real-estate videos from YouTube. Our dataset CAD-Estate offers 101k instances of 12k unique CAD models placed in the 3D representations of 20k videos. In comparison to Scan2CAD, the largest existing dataset with CAD model annotations on real scenes, CAD-Estate has 7x more instances and 4x more unique CAD models. We showcase the benefits of pre-training a Mask2CAD model on CAD-Estate for the task of automatic 3D object reconstruction and pose estimation, demonstrating that it leads to performance improvements on the popular Scan2CAD benchmark. The dataset is available at https://github.com/google-research/cad-estate.

Citations (5)

View on Semantic Scholar

Summary

The paper presents a semi-automatic annotation process that scales CAD model alignment in RGB videos using human-in-the-loop refinement.
It constructs CAD-Estate with 101K object instances and 12K unique models from 20,000 videos, outperforming previous datasets.
The dataset enhances pre-training for deep learning tasks in 3D reconstruction and pose estimation for real-world multi-object scenes.

An Overview of CAD-Estate: A Large-Scale RGB Video Dataset for 3D Object Annotation with CAD Models

The paper presents a novel approach for annotating complex multi-object scenes within RGB videos, culminating in the CAD-Estate dataset. The central aim is to provide a globally consistent 3D representation of each scene through the alignment of CAD models from a database, situated within the 3D coordinate frame of the scene using a 9-DoF pose transformation.

Method and Dataset Details

A pivotal contribution is the semi-automatic annotation process, which allows for large-scale data generation from readily available RGB videos without the need for depth sensors. The workflow capitalizes on human annotators to perform straightforward tasks, thus permitting crowd-sourcing. This methodological choice led to the construction of the CAD-Estate dataset derived from real-estate videos sourced primarily from YouTube.

The dataset comprises approximately 101,000 CAD model instances distilled from 20,000 videos, showcasing 12,000 unique CAD models. This scale is a marked improvement over existing datasets like Scan2CAD, with CAD-Estate offering a 7x increase in object instances and a 4x increase in unique CAD models. Such a comprehensive dataset enhances its utility for pre-training deep learning models required for automatic 3D object reconstruction and pose estimation tasks. The paper validates this by demonstrating performance improvements on the well-known Scan2CAD benchmark when models are pre-trained on CAD-Estate.

Dataset Construction

CAD-Estate's construction involves five distinct stages:

2D Object Detection and Tracking: Automated detection and tracking pipelines are employed to associate object detections across video frames, forming object-specific tracks.
CAD Model Selection: For each track, potential CAD models are automatically identified from a repository and refined by human selection, ensuring correctness and relevance.
3D to 2D Correspondence Annotation: Human annotators establish correspondence between CAD model points and video frames, facilitating precise pose determination.
Pose Optimization: Non-linear optimization integrates multi-view evidence to estimate a cohesive 9-DoF transformation, balancing objectives like minimizing reprojection error and maintaining logical object positioning.
Verification: Human reviewers verify pose estimation by comparing rendered CAD models with the video imagery for quality assurance.

Comparative Analysis and Implications

An extensive comparative analysis highlights how CAD-Estate surpasses prior datasets in scale and diversity, detailing attributes such as object class distribution, number of objects per video, and camera framing statistics. The dataset's emphasis on distant and diverse multi-object scenes—combined with a high visibility of complete objects—presents a unique challenge for predictive models that necessitates robust handling of complex scenes.

The implications of CAD-Estate are profound. It provides a substantial resource for training and evaluating AI models in tasks related to semantic 3D scene understanding, particularly in contexts deprived of high-fidelity input like RGB-D sensors. By opening up novel avenues in model pre-training, CAD-Estate unlocks potential improvements in downstream tasks across domains like augmented reality, robotics, and autonomous driving.

Future Directions

The introduction of CAD-Estate sets a foundational precedent for further research in scalable video annotation methods and the application of CAD-based priors. It anticipates stimulating advancements in learning-based multi-object 3D reconstruction methodologies, enhancing AI's ability to interpret and reconstruct complex real-world environments from accessible RGB inputs. Future developments may explore extending the dataset with more categories or improved automated annotation techniques, further refining the balance between human intervention and computational inference.

PDF Markdown

Related Papers

GitHub

GitHub - google-research/cad-estate (116 stars)