- The paper presents a semi-automatic annotation process that scales CAD model alignment in RGB videos using human-in-the-loop refinement.
- It constructs CAD-Estate with 101K object instances and 12K unique models from 20,000 videos, outperforming previous datasets.
- The dataset enhances pre-training for deep learning tasks in 3D reconstruction and pose estimation for real-world multi-object scenes.
An Overview of CAD-Estate: A Large-Scale RGB Video Dataset for 3D Object Annotation with CAD Models
The paper presents a novel approach for annotating complex multi-object scenes within RGB videos, culminating in the CAD-Estate dataset. The central aim is to provide a globally consistent 3D representation of each scene through the alignment of CAD models from a database, situated within the 3D coordinate frame of the scene using a 9-DoF pose transformation.
Method and Dataset Details
A pivotal contribution is the semi-automatic annotation process, which allows for large-scale data generation from readily available RGB videos without the need for depth sensors. The workflow capitalizes on human annotators to perform straightforward tasks, thus permitting crowd-sourcing. This methodological choice led to the construction of the CAD-Estate dataset derived from real-estate videos sourced primarily from YouTube.
The dataset comprises approximately 101,000 CAD model instances distilled from 20,000 videos, showcasing 12,000 unique CAD models. This scale is a marked improvement over existing datasets like Scan2CAD, with CAD-Estate offering a 7x increase in object instances and a 4x increase in unique CAD models. Such a comprehensive dataset enhances its utility for pre-training deep learning models required for automatic 3D object reconstruction and pose estimation tasks. The paper validates this by demonstrating performance improvements on the well-known Scan2CAD benchmark when models are pre-trained on CAD-Estate.
Dataset Construction
CAD-Estate's construction involves five distinct stages:
- 2D Object Detection and Tracking: Automated detection and tracking pipelines are employed to associate object detections across video frames, forming object-specific tracks.
- CAD Model Selection: For each track, potential CAD models are automatically identified from a repository and refined by human selection, ensuring correctness and relevance.
- 3D to 2D Correspondence Annotation: Human annotators establish correspondence between CAD model points and video frames, facilitating precise pose determination.
- Pose Optimization: Non-linear optimization integrates multi-view evidence to estimate a cohesive 9-DoF transformation, balancing objectives like minimizing reprojection error and maintaining logical object positioning.
- Verification: Human reviewers verify pose estimation by comparing rendered CAD models with the video imagery for quality assurance.
Comparative Analysis and Implications
An extensive comparative analysis highlights how CAD-Estate surpasses prior datasets in scale and diversity, detailing attributes such as object class distribution, number of objects per video, and camera framing statistics. The dataset's emphasis on distant and diverse multi-object scenes—combined with a high visibility of complete objects—presents a unique challenge for predictive models that necessitates robust handling of complex scenes.
The implications of CAD-Estate are profound. It provides a substantial resource for training and evaluating AI models in tasks related to semantic 3D scene understanding, particularly in contexts deprived of high-fidelity input like RGB-D sensors. By opening up novel avenues in model pre-training, CAD-Estate unlocks potential improvements in downstream tasks across domains like augmented reality, robotics, and autonomous driving.
Future Directions
The introduction of CAD-Estate sets a foundational precedent for further research in scalable video annotation methods and the application of CAD-based priors. It anticipates stimulating advancements in learning-based multi-object 3D reconstruction methodologies, enhancing AI's ability to interpret and reconstruct complex real-world environments from accessible RGB inputs. Future developments may explore extending the dataset with more categories or improved automated annotation techniques, further refining the balance between human intervention and computational inference.