DiffCAD: Weakly-Supervised Probabilistic CAD Model Retrieval and Alignment from an RGB Image (2311.18610v2)
Abstract: Perceiving 3D structures from RGB images based on CAD model primitives can enable an effective, efficient 3D object-based representation of scenes. However, current approaches rely on supervision from expensive annotations of CAD models associated with real images, and encounter challenges due to the inherent ambiguities in the task -- both in depth-scale ambiguity in monocular perception, as well as inexact matches of CAD database models to real observations. We thus propose DiffCAD, the first weakly-supervised probabilistic approach to CAD retrieval and alignment from an RGB image. We formulate this as a conditional generative task, leveraging diffusion to learn implicit probabilistic models capturing the shape, pose, and scale of CAD objects in an image. This enables multi-hypothesis generation of different plausible CAD reconstructions, requiring only a few hypotheses to characterize ambiguities in depth/scale and inexact shape matches. Our approach is trained only on synthetic data, leveraging monocular depth and mask estimates to enable robust zero-shot adaptation to various real target domains. Despite being trained solely on synthetic data, our multi-hypothesis approach can even surpass the supervised state-of-the-art on the Scan2CAD dataset by 5.9% with 8 hypotheses.
- Scan2cad: Learning cad model alignment in rgb-d scans. In Proceedings of the IEEE/CVF Conference on computer vision and pattern recognition, pages 2614–2623, 2019a.
- End-to-end cad model retrieval and 9dof alignment in 3d scans. In Proceedings of the IEEE/CVF International Conference on computer vision, pages 2551–2560, 2019b.
- Scenecad: Predicting object alignments and layouts in rgb-d scans. In Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXII 16, pages 596–612. Springer, 2020.
- Irondepth: Iterative refinement of single-view depth using surface normal and its uncertainty. ArXiv, abs/2210.03676, 2022.
- Label-efficient semantic segmentation with diffusion models. ArXiv, abs/2112.03126, 2021.
- ARKitscenes - a diverse real-world dataset for 3d indoor scene understanding using mobile RGB-d data. In Thirty-fifth Conference on Neural Information Processing Systems Datasets and Benchmarks Track (Round 1), 2021.
- Weakly-supervised end-to-end cad retrieval to scan objects. ArXiv, abs/2203.12873, 2022.
- Adabins: Depth estimation using adaptive bins. 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 4008–4017, 2020.
- Zoedepth: Zero-shot transfer by combining relative and metric depth. arXiv preprint arXiv:2302.12288, 2023.
- Retrieval-augmented diffusion models. Advances in Neural Information Processing Systems, 35:15309–15324, 2022.
- End-to-end object detection with transformers. ArXiv, abs/2005.12872, 2020.
- Shapenet: An information-rich 3d model repository. arXiv preprint arXiv:1512.03012, 2015.
- Re-imagen: Retrieval-augmented text-to-image generator. ArXiv, abs/2209.14491, 2022.
- Diffusionsdf: Conditional generative modeling of signed distance functions. ArXiv, abs/2211.13757, 2022.
- Diffusion-sdf: Conditional generative modeling of signed distance functions. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 2262–2272, 2023.
- 3d-r2n2: A unified approach for single and multi-view 3d object reconstruction. In Computer Vision–ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, October 11-14, 2016, Proceedings, Part VIII 14, pages 628–644. Springer, 2016.
- Scannet: Richly-annotated 3d reconstructions of indoor scenes. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 5828–5839, 2017.
- Nerdi: Single-view nerf synthesis with language-guided diffusion as general image priors. 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 20637–20647, 2022.
- Blenderproc2: A procedural pipeline for photorealistic rendering. Journal of Open Source Software, 8(82):4901, 2023.
- U-red: Unsupervised 3d shape retrieval and deformation for partial point clouds. ArXiv, abs/2308.06383, 2023.
- Omnidata: A scalable pipeline for making multi-task mid-level vision datasets from 3d scans. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 10786–10796, 2021.
- Hyperdiffusion: Generating implicit neural fields with weight-space diffusion. arXiv preprint arXiv:2303.17015, 2023.
- A point set generation network for 3d object reconstruction from a single image. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 605–613, 2017.
- Eva: Exploring the limits of masked visual representation learning at scale. 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 19358–19369, 2022.
- Random sample consensus: a paradigm for model fitting with applications to image analysis and automated cartography. Communications of the ACM, 24(6):381–395, 1981.
- 3d-front: 3d furnished rooms with layouts and semantics. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 10933–10942, 2021.
- Yolox: Exceeding yolo series in 2021. arXiv preprint arXiv:2107.08430, 2021.
- Scaling open-vocabulary image segmentation with image-level labels. In ECCV, 2022.
- Mesh r-cnn. In Proceedings of the IEEE/CVF international conference on computer vision, pages 9785–9795, 2019.
- Towards zero-shot scale-aware monocular depth estimation. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 2023.
- Roca: Robust cad model retrieval and alignment from a single image. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 4022–4031, 2022.
- Mask r-cnn. In Proceedings of the IEEE international conference on computer vision, pages 2961–2969, 2017.
- Unsupervised semantic correspondence using stable diffusion. arXiv preprint arXiv:2305.15581, 2023.
- Denoising diffusion probabilistic models. ArXiv, abs/2006.11239, 2020a.
- Denoising diffusion probabilistic models. Advances in neural information processing systems, 33:6840–6851, 2020b.
- Centersnap: Single-shot multi-object 3d shape reconstruction and categorical 6d pose and size estimation. 2022 International Conference on Robotics and Automation (ICRA), pages 10632–10640, 2022.
- Cad-deform: Deformable fitting of cad models to 3d scans. In Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XIII 16, pages 599–628. Springer, 2020.
- Im2cad. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 5134–5143, 2017.
- Variational diffusion models. ArXiv, abs/2107.00630, 2021.
- Segment anything. arXiv preprint arXiv:2304.02643, 2023.
- Salad: Part-level latent diffusion for 3d shape generation and manipulation. ArXiv, abs/2303.12236, 2023.
- Mask2cad: 3d shape prediction by learning to segment and retrieve. In Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part III 16, pages 260–277. Springer, 2020.
- Patch2cad: Patchwise embedding learning for in-the-wild shape retrieval from a single image. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 12589–12599, 2021.
- Sparc: Sparse render-and-compare for cad model alignment in a single rgb image. arXiv preprint arXiv:2210.01044, 2022.
- Sparse multi-object render-and-compare. arXiv preprint arXiv:2310.11184, 2023.
- Language-driven semantic segmentation. ArXiv, abs/2201.03546, 2022a.
- Diffusion-sdf: Text-to-shape via voxelized diffusion. 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 12642–12651, 2022b.
- Depthformer: Exploiting long-range correlation and local information for accurate monocular depth estimation. Machine Intelligence Research, 20:837 – 854, 2022c.
- Open-vocabulary semantic segmentation with mask-adapted clip. 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 7061–7070, 2022.
- Vision transformer for nerf-based view synthesis from a single input image. 2023 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), pages 806–815, 2022.
- Feature pyramid networks for object detection. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 2117–2125, 2017.
- Convolution in the cloud: Learning deformable kernels in 3d graph convolution networks for point cloud analysis. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 1800–1809, 2020.
- Towards high-fidelity single-view holistic reconstruction of indoor scenes. ArXiv, abs/2207.08656, 2022.
- Fully convolutional networks for semantic segmentation. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 3431–3440, 2015.
- Diffusion hyperfeatures: Searching through time and space for semantic correspondence. In Advances in Neural Information Processing Systems, 2023.
- 3d-lmnet: Latent embedding matching for accurate and diverse 3d point cloud reconstruction from a single image. ArXiv, abs/1807.07796, 2018.
- Vid2cad: Cad model alignment using multi-view constraints from videos. IEEE transactions on pattern analysis and machine intelligence, 45(1):1320–1327, 2022.
- Occupancy networks: Learning 3d reconstruction in function space. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 4460–4470, 2019.
- 3d-ldm: Neural implicit 3d shape generation with latent diffusion models. ArXiv, abs/2212.00842, 2022.
- Total3dunderstanding: Joint layout, object pose and mesh reconstruction for indoor scenes from a single image. 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 52–61, 2020.
- Deep mesh reconstruction from single rgb images via topology modification networks. 2019 IEEE/CVF International Conference on Computer Vision (ICCV), pages 9963–9972, 2019.
- Convolutional occupancy networks. In Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part III 16, pages 523–540. Springer, 2020.
- Sharpnet: Fast and accurate recovery of occluding contours in monocular depth estimation. 2019 IEEE/CVF International Conference on Computer Vision Workshop (ICCVW), pages 2109–2118, 2019.
- Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:2204.06125, 1(2):3, 2022.
- Vision transformers for dense prediction. In Proceedings of the IEEE/CVF international conference on computer vision, pages 12179–12188, 2021.
- Accelerating 3d deep learning with pytorch3d. arXiv preprint arXiv:2007.08501, 2020.
- High-resolution image synthesis with latent diffusion models, 2021.
- Imagenet large scale visual recognition challenge. International journal of computer vision, 115:211–252, 2015.
- Knn-diffusion: Image generation via large-scale retrieval. arXiv preprint arXiv:2204.02849, 2022.
- 3d neural field generation using triplane diffusion. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 20875–20886, 2023.
- Deep unsupervised learning using nonequilibrium thermodynamics. ArXiv, abs/1503.03585, 2015.
- Generative modeling by estimating gradients of the data distribution. Advances in neural information processing systems, 32, 2019.
- Score-based generative modeling through stochastic differential equations. ArXiv, abs/2011.13456, 2020.
- Learning 3d shape completion under weak supervision. International Journal of Computer Vision, 128:1162–1181, 2020.
- Emergent correspondence from image diffusion. arXiv preprint arXiv:2306.03881, 2023.
- 3d-mask-gan:unsupervised single-view 3d object reconstruction. 2019 6th International Conference on Behavioral, Economic and Socio-Cultural Computing (BESC), pages 1–6, 2019.
- Normalized object coordinate space for category-level 6d object pose and size estimation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 2642–2651, 2019.
- Pixel2mesh: Generating 3d mesh models from single rgb images. ArXiv, abs/1804.01654, 2018.
- Catgrasp: Learning category-level task-relevant grasping in clutter from simulation. ICRA 2022, 2022.
- Open-vocabulary panoptic segmentation with text-to-image diffusion models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 2955–2966, 2023.
- pixelnerf: Neural radiance fields from one or few images. 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 4576–4585, 2020.
- Lion: Latent point diffusion models for 3d shape generation. ArXiv, abs/2210.06978, 2022.
- 3dshape2vecset: A 3d shape representation for neural fields and generative diffusion models. ACM Transactions on Graphics (TOG), 42:1 – 16, 2023.
- Holistic 3d scene understanding from a single image with implicit representation. 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 8829–8838, 2021.
- The unreasonable effectiveness of deep features as a perceptual metric. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 586–595, 2018.
- 3d shape generation and completion through point-voxel diffusion. 2021 IEEE/CVF International Conference on Computer Vision (ICCV), pages 5806–5815, 2021.