DiffPoint: Single and Multi-view Point Cloud Reconstruction with ViT Based Diffusion Model (2402.11241v1)
Abstract: As the task of 2D-to-3D reconstruction has gained significant attention in various real-world scenarios, it becomes crucial to be able to generate high-quality point clouds. Despite the recent success of deep learning models in generating point clouds, there are still challenges in producing high-fidelity results due to the disparities between images and point clouds. While vision transformers (ViT) and diffusion models have shown promise in various vision tasks, their benefits for reconstructing point clouds from images have not been demonstrated yet. In this paper, we first propose a neat and powerful architecture called DiffPoint that combines ViT and diffusion models for the task of point cloud reconstruction. At each diffusion step, we divide the noisy point clouds into irregular patches. Then, using a standard ViT backbone that treats all inputs as tokens (including time information, image embeddings, and noisy patches), we train our model to predict target points based on input images. We evaluate DiffPoint on both single-view and multi-view reconstruction tasks and achieve state-of-the-art results. Additionally, we introduce a unified and flexible feature fusion module for aggregating image features from single or multiple input images. Furthermore, our work demonstrates the feasibility of applying unified architectures across languages and images to improve 3D reconstruction tasks.
- All are worth words: A vit backbone for diffusion models. 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 22669–22679, 2022.
- Mip-nerf: A multiscale representation for anti-aliasing neural radiance fields. 2021 IEEE/CVF International Conference on Computer Vision (ICCV), pages 5835–5844, 2021.
- Nope-nerf: Optimising neural radiance field with no pose prior. 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 4160–4169, 2022.
- Language models are few-shot learners. ArXiv, abs/2005.14165, 2020.
- Shapenet: An information-rich 3d model repository. ArXiv, abs/1512.03012, 2015.
- Muse: Text-to-image generation via masked generative transformers. ArXiv, abs/2301.00704, 2023.
- 3d-r2n2: A unified approach for single and multi-view 3d object reconstruction. In ECCV, 2016.
- Objaverse: A universe of annotated 3d objects. ArXiv, abs/2212.08051, 2022.
- Bert: Pre-training of deep bidirectional transformers for language understanding. In North American Chapter of the Association for Computational Linguistics, 2019.
- Diffusion models beat gans on image synthesis. NeurIPS, 2021.
- An image is worth 16x16 words: Transformers for image recognition at scale. ArXiv, abs/2010.11929, 2020.
- Point transformer. IEEE Access, 9:134826–134840, 2020.
- A point set generation network for 3d object reconstruction from a single image. 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 2463–2471, 2017.
- Atlasnet: A papier-mâché approach to learning 3d surface generation. CVPR, 2018.
- Masked autoencoders are scalable vision learners. 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 15979–15988, 2021.
- Denoising diffusion probabilistic models. NeurIPS, 2020.
- Multi-modal sensor fusion for auto driving perception: A survey. ArXiv, abs/2202.02703, 2022.
- Guided-tts: A diffusion model for text-to-speech via classifier guidance. In ICML, 2022.
- Diffwave: A versatile diffusion model for audio synthesis. ICLR, 2021.
- Surround-view fisheye camera perception for automated driving: Overview, survey and challenges. ArXiv, abs/2205.13281, 2022.
- Masked discrimination for self-supervised learning on point clouds. In European Conference on Computer Vision, 2022.
- Neural sparse voxel fields. ArXiv, abs/2007.11571, 2020.
- Swin transformer: Hierarchical vision transformer using shifted windows. 2021 IEEE/CVF International Conference on Computer Vision (ICCV), pages 9992–10002, 2021.
- Decoupled weight decay regularization. In International Conference on Learning Representations, 2017.
- An rgb-d based augmented reality 3d reconstruction system for robotic environmental inspection of radioactive areas. In ICINCO, 2017.
- Diffusion probabilistic models for 3d point cloud generation. 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 2836–2844, 2021.
- A review on viewpoints and path-planning for uav-based 3d reconstruction. ArXiv, abs/2205.03716, 2022.
- Occupancy networks: Learning 3d reconstruction in function space. 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 4455–4465, 2019.
- Nerf. Communications of the ACM, 65:99 – 106, 2020.
- Dit-3d: Exploring plain diffusion transformers for 3d shape generation. ArXiv, abs/2307.01831, 2023.
- Diffrf: Rendering-guided 3d radiance field diffusion. 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 4328–4338, 2022.
- Glide: Towards photorealistic image generation and editing with text-guided diffusion models. In ICML, 2022.
- Deep mesh reconstruction from single rgb images via topology modification networks. 2019 IEEE/CVF International Conference on Computer Vision (ICCV), pages 9963–9972, 2019.
- Masked autoencoders for point cloud self-supervised learning. In European Conference on Computer Vision, 2022.
- Automatic differentiation in pytorch. 2017.
- Grad-tts: A diffusion probabilistic model for text-to-speech. ICML, 2021.
- Diffusion autoencoders: Toward a meaningful and decodable representation. 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 10609–10619, 2021.
- Pointnet: Deep learning on point sets for 3d classification and segmentation. 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 77–85, 2016.
- Pointnet++: Deep hierarchical feature learning on point sets in a metric space. In Neural Information Processing Systems, 2017.
- Improving language understanding by generative pre-training. 2018.
- Learning transferable visual models from natural language supervision. In International Conference on Machine Learning, 2021.
- Exploring the limits of transfer learning with a unified text-to-text transformer. J. Mach. Learn. Res., 21:140:1–140:67, 2019.
- Hierarchical text-conditional image generation with clip latents. ArXiv, abs/2204.06125, 2022.
- High-resolution image synthesis with latent diffusion models. 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 10674–10685, 2021.
- Deep unsupervised learning using nonequilibrium thermodynamics. ICML, 2015.
- Score-based generative modeling through stochastic differential equations. ArXiv, abs/2011.13456, 2020.
- Csdi: Conditional score-based diffusion models for probabilistic time series imputation. In NeurIPS, 2021.
- What do single-view 3d reconstruction networks learn? 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 3400–3409, 2019.
- Attention is all you need. In NIPS, 2017.
- Pixel2mesh: Generating 3d mesh models from single rgb images. In ECCV, 2018a.
- Adaptive o-cnn. ACM Transactions on Graphics (TOG), 37:1 – 11, 2018b.
- Pixel2mesh++: Multi-view 3d mesh generation via deformation. 2019 IEEE/CVF International Conference on Computer Vision (ICCV), pages 1042–1051, 2019.
- Pixel2mesh++: 3d mesh generation and refinement from multi-view images. IEEE Transactions on Pattern Analysis and Machine Intelligence, 45:2166–2180, 2022a.
- 3d shape reconstruction from 2d images with disentangled attribute flow. 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 3793–3803, 2022b.
- 3d shapenets: A deep representation for volumetric shapes. 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 1912–1920, 2015.
- Diffusionerf: Regularizing neural radiance fields with denoising diffusion models. 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 4180–4189, 2023.
- Pix2vox: Context-aware 3d reconstruction from single and multi-view images. 2019 IEEE/CVF International Conference on Computer Vision (ICCV), pages 2690–2698, 2019.
- Pix2vox++: Multi-scale context-aware 3d object reconstruction from single and multiple images. International Journal of Computer Vision, pages 1 – 17, 2020.
- Legoformer: Transformers for block-by-block multi-view 3d reconstruction. CVPR, abs/2106.12102, 2021.
- Robust attentional aggregation of deep feature sets for multi-view 3d reconstruction. International Journal of Computer Vision, 128:53 – 73, 2018.
- pixelnerf: Neural radiance fields from one or few images. 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 4576–4585, 2020.
- Point-bert: Pre-training 3d point cloud transformers with masked point modeling. 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 19291–19300, 2021.
- Tokens-to-token vit: Training vision transformers from scratch on imagenet. 2021 IEEE/CVF International Conference on Computer Vision (ICCV), pages 538–547, 2021.
- Lion: Latent point diffusion models for 3d shape generation. ArXiv, abs/2210.06978, 2022.
- 3d shape generation and completion through point-voxel diffusion. 2021 IEEE/CVF International Conference on Computer Vision (ICCV), pages 5806–5815, 2021.