WALT3D: Generating Realistic Training Data from Time-Lapse Imagery for Reconstructing Dynamic Objects under Occlusion (2403.19022v2)
Abstract: Current methods for 2D and 3D object understanding struggle with severe occlusions in busy urban environments, partly due to the lack of large-scale labeled ground-truth annotations for learning occlusion. In this work, we introduce a novel framework for automatically generating a large, realistic dataset of dynamic objects under occlusions using freely available time-lapse imagery. By leveraging off-the-shelf 2D (bounding box, segmentation, keypoint) and 3D (pose, shape) predictions as pseudo-groundtruth, unoccluded 3D objects are identified automatically and composited into the background in a clip-art style, ensuring realistic appearances and physically accurate occlusion configurations. The resulting clip-art image with pseudo-groundtruth enables efficient training of object reconstruction methods that are robust to occlusions. Our method demonstrates significant improvements in both 2D and 3D reconstruction, particularly in scenarios with heavily occluded objects like vehicles and people in urban scenes.
- Amodal 3d reconstruction for robotic manipulation via stability and connectivity. In CoRL, 2021.
- Augmented reality meets deep learning for car instance segmentation in urban scenes. In BMVC, 2017.
- Asra Aslam. Detecting objects in less response time for processing multimedia events in smart cities. In CVPR, 2022.
- Simple online and realtime tracking. In ICIP, pages 3464–3468, 2016.
- Omni3d: A large benchmark and model for 3d object detection in the wild. In CVPR, 2023.
- nuscenes: A multimodal dataset for autonomous driving. In CVPR, 2020.
- Yale-cmu-berkeley dataset for robotic manipulation research. IJRR, 2017.
- Argoverse: 3d tracking and forecasting with rich maps. In CVPR, 2019.
- End-to-end learnable geometric vision by backpropagating pnp optimization. In CVPR, 2020.
- Masked-attention mask transformer for universal image segmentation. In CVPR, 2022.
- 3d-r2n2: A unified approach for single and multi-view 3d object reconstruction. In ECCV, 2016.
- Objaverse: A universe of annotated 3d objects. In CVPR, 2023.
- Mot20: A benchmark for multi object tracking in crowded scenes. arXiv preprint arXiv:2003.09003, 2020.
- ImageNet: A Large-Scale Hierarchical Image Database. In CVPR, 2009.
- Ithaca365: Dataset and driving perception under repeated and challenging weather conditions. In CVPR, 2022.
- An image is worth 16x16 words: Transformers for image recognition at scale. In ICLR, 2021.
- Segan: Segmenting and generating the invisible. In CVPR, 2018.
- Learning to detect and track visible and occluded body joints in a virtual world. In ECCV, 2018.
- Learning to see the invisible: End-to-end trainable amodal instance segmentation. In WACV, 2019.
- A mean field em-algorithm for coherent occlusion handling in map-estimation prob. In CVPR, 2006.
- A segmentation-aware object detection model with occlusion handling. In CVPR, 2011.
- Are we ready for autonomous driving? the kitti vision benchmark suite. In CVPR, 2012.
- Simple copy-paste is a strong data augmentation method for instance segmentation. In CVPR, 2021.
- Video based traffic congestion prediction on an embedded system. In ITSC, 2008.
- Humans in 4D: Reconstructing and tracking humans with transformers. In ICCV, 2023.
- Google. Google Street View. https://www.google.com/streetview/.
- Roca: robust cad model retrieval and alignment from a single image. In CVPR, 2022.
- Beyond the line of sight: labeling the underlying surfaces. In ECCV, 2012.
- Deep residual learning for image recognition. In CVPR, 2016.
- Mask r-cnn. In ICCV, 2017.
- SAIL-VOS: Semantic Amodal Instance Level Video Object Segmentation – A Synthetic Dataset and Baselines. In CVPR, 2019.
- Cad-deform: Deformable fitting of cad models to 3d scans. In ECCV, 2020.
- End-to-end recovery of human shape and pose. In CVPR, 2018a.
- Learning category-specific mesh reconstruction from image collections. In ECCV, 2018b.
- Gaetano Kanizsa. Organization in vision: Essays on gestalt perception. Praeger Publishers, 1979.
- Gsnet: Joint vehicle pose and shape reconstruction with geometrical and scene-aware supervision. In ECCV, 2020.
- Deep occlusion-aware instance segmentation with overlapping bilayers. In CVPR, 2021.
- Vibe: Video inference for human body pose and shape estimation. In CVPR, 2020.
- Pare: Part attention regressor for 3d human body estimation. In ICCV, 2021.
- Philipp Krähenbühl. Free supervision from video games. In CVPR, 2018.
- Imagenet classification with deep convolutional neural networks. In NIPS, 2012.
- Learning analysis-by-synthesis for 6d pose estimation in rgb-d images. In ICCV, 2015.
- Articulation-aware canonical surface mapping. In CVPR, 2020.
- 3d-rcnn: Instance-level 3d object reconstruction via render-and-compare. In CVPR, 2018.
- Epnp: An accurate o (n) solution to the pnp problem. IJCV, 2009.
- Deep supervision with shape concepts for occlusion-aware 3d object parsing. In CVPR, 2017.
- Traffic4d: Single view longitudinal 4d reconstruction of repetitious activity using self-supervised experts. In IV, 2021.
- Exploring plain vision transformer backbones for object detection. In ECCV, 2022.
- Microsoft coco: Common objects in context. In ECCV, 2014.
- Swin transformer: Hierarchical vision transformer using shifted windows. In ICCV, 2021.
- SMPL: A skinned multi-person linear model. SIGGRAPH Asia, 2015.
- Robust category-level 6d pose estimation with coarse-to-fine rendering of neural features. In ECCV, 2022.
- The 6th ai city challenge. In CVPRW, 2022.
- Deepsdf: Learning continuous signed distance functions for shape representation. In CVPR, 2019.
- Pointnet++: Deep hierarchical feature learning on point sets in a metric space. NeurIPS, 2017.
- Amodal instance segmentation with kins dataset. In CVPR, 2019.
- Carfusion: Combining point tracking and part detection for dynamic 3d reconstruction of vehicles. In CVPR, 2018.
- Occlusion-net: 2d/3d occluded keypoint localization using graph networks. In CVPR, 2019.
- Walt: Watch and learn 2d amodal representation from time-lapse imagery. In CVPR, 2022.
- Faster r-cnn: Towards real-time object detection with region proposal networks. In NeurIPS, 2015.
- Octnet: Learning deep 3d representations at high resolutions. In CVPR, 2017.
- Learning to look around objects for top-view representations of outdoor scenes. In ECCV, 2018.
- Hand keypoint detection in single images using multiview bootstrapping. In CVPR, 2017.
- Apollocar3d: A large 3d car instance understanding benchmark for autonomous driving. In CVPR, 2019.
- Deep high-resolution representation learning for human pose estimation. In CVPR, 2019.
- Putting people in their place: Monocular regression of 3d people in depth. In CVPR, 2022.
- Viewpoints and keypoints. In CVPR, 2015.
- Structured output regression for detection with partial truncation. In NeurIPS, 2009.
- Toward planet-wide traffic camera calibration. In WACV, 2024.
- Nemo: Neural mesh models of contrastive features for robust 3d pose estimation. In ICLR, 2021.
- Normalized object coordinate space for category-level 6d object pose and size estimation. In CVPR, 2019.
- Simple online and realtime tracking with a deep association metric. In ICIP, 2017.
- Pointconv: Deep convolutional networks on 3d point clouds. In CVPR, 2019.
- Beyond pascal: A benchmark for 3d object detection in the wild. In WACV, 2014.
- Learning to track: Online multi-object tracking by decision making. In ICCV, 2015.
- Bdd100k: A diverse driving dataset for heterogeneous multitask learning. In CVPR, 2020.
- Robust instance segmentation through reasoning about multi-object occlusion. In CVPR, 2021.
- Bytetrack: Multi-object tracking by associating every detection box. In ECCV, 2022.
- 3d shape estimation from 2d landmarks: A convex relaxation approach. In CVPR, 2015.
- Voxelnet: End-to-end learning for point cloud based 3d object detection. In CVPR, 2018.
- Semantic amodal segmentation. In CVPR, 2017.
- Towards scene understanding with detailed 3d object representations. IJCV, 2015.