AutoDecoding Latent 3D Diffusion Models (2307.05445v1)
Abstract: We present a novel approach to the generation of static and articulated 3D assets that has a 3D autodecoder at its core. The 3D autodecoder framework embeds properties learned from the target dataset in the latent space, which can then be decoded into a volumetric representation for rendering view-consistent appearance and geometry. We then identify the appropriate intermediate volumetric latent space, and introduce robust normalization and de-normalization operations to learn a 3D diffusion from 2D images or monocular videos of rigid or articulated objects. Our approach is flexible enough to use either existing camera supervision or no camera information at all -- instead efficiently learning it during training. Our evaluations demonstrate that our generation results outperform state-of-the-art alternatives on various benchmark datasets and metrics, including multi-view image datasets of synthetic objects, real in-the-wild videos of moving people, and a large-scale, real video dataset of static objects.
- Learning Representations and Generative Models for 3D Point Clouds. In Proceedings of the International Conference on Machine Learning, 2018.
- Demystifying MMD GANs. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2018.
- Optimizing the latent space of generative networks. In arXiv, 2017.
- Large scale gan training for high fidelity natural image synthesis. In arXiv, 2018.
- pi-GAN: Periodic Implicit Generative Adversarial Networks for 3D-Aware Image Synthesis. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2021.
- Efficient Geometry-aware 3D Generative Adversarial Networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2022.
- ShapeNet: An Information-Rich 3D Model Repository. In arXiv, 2015.
- WaveGrad: Estimating Gradients for Waveform Generation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2021.
- Fantasia3D: Disentangling Geometry and Appearance for High-quality Text-to-3D Content Creation. In arXiv, 2023.
- SDFusion: Multimodal 3d shape completion, reconstruction, and generation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2023.
- ABO: Dataset and Benchmarks for Real-World 3D Object Understanding. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2022.
- Objaverse: A Universe of Annotated 3D Objects. In arXiv, 2022.
- Imagenet: A large-scale hierarchical image database. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2009.
- MIT 6.006, Lecture 5: Hashing I: Chaining, Hash Functions, 2009.
- Diffusion Models Beat Gans on Image Synthesis. In Proceedings of the Neural Information Processing Systems Conference, 2021.
- Score-based generative modeling with critically-damped langevin diffusion. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2022.
- PyTorch Lightning. GitHub. Note: https://github.com/PyTorchLightning/pytorch-lightning, 3, 2019.
- Riffusion - Stable diffusion for real-time music generation, 2022. URL https://riffusion.com/about.
- Generative adversarial nets. In Proceedings of the Neural Information Processing Systems Conference, 2014.
- Flexible Diffusion Modeling of Long Videos. In Proceedings of the Neural Information Processing Systems Conference, 2022.
- Latent video diffusion models for high-fidelity long video generation. In arXiv, 2023.
- GANs Trained by a Two Time-Scale Update Rule Converge to a Local Nash Equilibrium. In Proceedings of the Neural Information Processing Systems Conference, 2017.
- Classifier-free diffusion guidance. In arXiv, 2022.
- Denoising diffusion probabilistic models. In Proceedings of the Neural Information Processing Systems Conference, 2020.
- Video Diffusion Models. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2022.
- A. Horé and D. Ziou. Image quality metrics: Psnr vs. ssim. In Proceedings of the International Conference on Pattern Recognition, 2010.
- Perceptual Losses for Real-Time Style Transfer and Super-Resolution. In Proceedings of the European Conference on Computer Vision, 2016.
- Progressive growing of GANs for improved quality, stability, and variation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2018.
- A style-based generator architecture for generative adversarial networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2019.
- Training generative adversarial networks with limited data. In arXiv, 2020a.
- Analyzing and Improving the Image Quality of StyleGAN. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2020b.
- Elucidating the Design Space of Diffusion-Based Generative Models. In Proceedings of the Neural Information Processing Systems Conference, 2022.
- Adam: A Method for Stochastic Optimization. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2015.
- Auto-encoding variational bayes. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2014.
- Segment Anything. In arXiv, 2023.
- EPnP: An Accurate O(n) Solution to the PnP Problem. In International Journal of Computer Vision, 2009.
- Pose Space Deformation: A Unified Approach to Shape Interpolation and Skeleton-Driven Deformation. In ACM Transactions on Graphics, 2000.
- BARF: Bundle-Adjusting Neural Radiance Fields. In Proceedings of the IEEE International Conference on Computer Vision, 2021.
- Magic3D: High-Resolution Text-to-3D Content Creation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2023.
- Robust High-Resolution Video Matting with Temporal Guidance. In Proceedings of the Winter Conference on Applications of Computer Vision, 2022.
- Grounding DINO: Marrying DINO with Grounded Pre-Training for Open-Set Object Detection. In arXiv, 2023.
- Marching Cubes: A High Resolution 3D Surface Construction Algorithm. In ACM Transactions on Graphics, 1987.
- VIDM: Video Implicit Diffusion Models. In Association for the Advancement of Artificial Intelligence Conference, 2023.
- NeRF: Representing scenes as Neural Radiance Fields for View Synthesis. In Proceedings of the European Conference on Computer Vision, 2020.
- DiffRF: Rendering-Guided 3D Radiance Field Diffusion. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2023.
- VoxCeleb: Large-scale speaker verification in the wild. Computer Science and Language, 2019.
- HoloGAN: Unsupervised Learning of 3D Representations From Natural Images. In Proceedings of the IEEE International Conference on Computer Vision, 2019.
- Blockgan: Learning 3d object-aware scene representations from unlabelled images. In arXiv, 2020.
- Improved denoising diffusion probabilistic models. In ICML, 2021.
- GIRAFFE: Representing Scenes as Compositional Generative Neural Feature Fields. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2021.
- StyleGenes: Discrete and Efficient Latent Distributions for GANs. In arXiv, 2023.
- High-fidelity performance metrics for generative models in PyTorch, 2020. URL https://github.com/toshas/torch-fidelity. Version: 0.3.0, DOI: 10.5281/zenodo.4957738.
- PhotoShape: Photorealistic Materials for Large-Scale Shape Collections. In ACM Transactions on Graphics, 2018.
- Automatic Differentiation in PyTorch, 2017.
- PyTorch: An Imperative Style, High-Performance Deep Learning Library. In Proceedings of the Neural Information Processing Systems Conference, 2019.
- Dreamfusion: Text-to-3d using 2d diffusion. In arXiv, 2022.
- Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer. In The Journal of Machine Learning Research, 2020.
- Accelerating 3D Deep Learning with PyTorch3D. In arXiv, 2020.
- High-Resolution Image Synthesis With Latent Diffusion Models. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2022.
- Structure-from-Motion Revisited. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2016.
- GRAF: Generative Radiance Fields for 3D-Aware Image Synthesis. In Proceedings of the Neural Information Processing Systems Conference, 2020.
- First Order Motion Model for Image Animation. In Proceedings of the Neural Information Processing Systems Conference, 2019.
- Unsupervised Volumetric Animation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2023.
- Very deep convolutional networks for large-scale image recognition. In arXiv, 2014.
- Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2022a.
- EpiGRAF: Rethinking Training of 3D GANs. In Proceedings of the Neural Information Processing Systems Conference, 2022b.
- 3D Generation on ImageNet. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2023.
- Deep unsupervised learning using nonequilibrium thermodynamics. In ICML, 2015.
- Denoising Diffusion Implicit Models. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2021a.
- Generative modeling by estimating gradients of the data distribution. In Proceedings of the Neural Information Processing Systems Conference, 2019.
- Score-based generative modeling through stochastic differential equations. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2021b.
- A good image generator is what you need for high-resolution video synthesis. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2021.
- Score-based generative modeling in latent space. In Proceedings of the Neural Information Processing Systems Conference, 2021.
- Attention is all you need. In Proceedings of the Neural Information Processing Systems Conference, 2017.
- MCVD: Masked Conditional Video Diffusion for Prediction, Generation, and Interpolation. In Proceedings of the Neural Information Processing Systems Conference, 2022.
- NeRF−−--- -: Neural Radiance Fields Without Known Camera Parameters. In arXiv, 2021.
- HumanNeRF: Free-Viewpoint Rendering of Moving People from Monocular Video. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2022.
- Dewey Lonzo Whaley III. The Interquartile Range: Theory and Estimation. PhD thesis, East Tennessee State University, 2005.
- Tackling the generative learning trilemma with denoising diffusion GANs. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2022.
- Pose for Everything: Towards Category-Agnostic Pose Estimation. In Proceedings of the European Conference on Computer Vision, 2022.
- GIRAFFE HD: A High-Resolution 3D-aware Generative Model. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2022.
- NUWA-XL: Diffusion over Diffusion for eXtremely Long Video Generation. In arXiv, 2023.
- CelebV-Text: A Large-Scale Facial Text-Video Dataset. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2023a.
- Generating videos with dynamics-aware implicit generative adversarial networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2022.
- MVImgNet: A Large-scale Dataset of Multi-view Images. In arXiv, 2023b.
- The Unreasonable Effectiveness of Deep Features as a Perceptual Metric. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2018.
- MiniGPT-4: Enhancing Vision-Language Understanding with Advanced Large Language Models. In arXiv, 2023a.
- Discrete contrastive diffusion for cross-modal music and image generation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2023b.