Motion2VecSets: 4D Latent Vector Set Diffusion for Non-rigid Shape Reconstruction and Tracking (2401.06614v2)
Abstract: We introduce Motion2VecSets, a 4D diffusion model for dynamic surface reconstruction from point cloud sequences. While existing state-of-the-art methods have demonstrated success in reconstructing non-rigid objects using neural field representations, conventional feed-forward networks encounter challenges with ambiguous observations from noisy, partial, or sparse point clouds. To address these challenges, we introduce a diffusion model that explicitly learns the shape and motion distribution of non-rigid objects through an iterative denoising process of compressed latent representations. The diffusion-based priors enable more plausible and probabilistic reconstructions when handling ambiguous inputs. We parameterize 4D dynamics with latent sets instead of using global latent codes. This novel 4D representation allows us to learn local shape and deformation patterns, leading to more accurate non-linear motion capture and significantly improving generalizability to unseen motions and identities. For more temporally-coherent object tracking, we synchronously denoise deformation latent sets and exchange information across multiple frames. To avoid computational overhead, we designed an interleaved space and time attention block to alternately aggregate deformation latents along spatial and temporal domains. Extensive comparisons against state-of-the-art methods demonstrate the superiority of our Motion2VecSets in 4D reconstruction from various imperfect observations. More detailed information can be found at https://vveicao.github.io/projects/Motion2VecSets/.
- Learning representations and generative models for 3d point clouds. arXiv preprint arXiv:1707.02392, 2017.
- Behave: Dataset and method for tracking human object interactions. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR). IEEE, 2022.
- Retrieval-augmented diffusion models, 2022.
- Dynamic FAUST: Registering human bodies in motion. In IEEE Conf. on Computer Vision and Pattern Recognition (CVPR), 2017.
- Neural deformation graphs for globally-consistent non-rigid reconstruction. CVPR, 2021.
- Deep local shapes: Learning local sdf priors for detailed 3d reconstruction, 2020.
- Gridpull: Towards scalability in learning implicit representations from 3d point clouds. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), 2023.
- Wavegrad: Estimating gradients for waveform generation, 2020.
- Neural ordinary differential equations. Advances in neural information processing systems, 31, 2018.
- Learning implicit fields for generative shape modeling. Proceedings of IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2019.
- Implicit functions in feature space for 3d shape reconstruction and completion. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR). IEEE, 2020.
- Diffusion-sdf: Conditional generative modeling of signed distance functions. 2023.
- 3d-r2n2: A unified approach for single and multi-view 3d object reconstruction. In Proceedings of the European Conference on Computer Vision (ECCV), 2016.
- Mofusion: A framework for denoising-diffusion-based motion synthesis. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 9760–9770, 2023.
- Diffusion models beat gans on image synthesis, 2021.
- A point set generation network for 3d object reconstruction from a single image. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 605–613, 2017.
- 3d shape induction from 2d views of multiple objects. In 2017 International Conference on 3D Vision (3DV), pages 402–411, 2017.
- Implicit geometric regularization for learning shapes. In Proceedings of Machine Learning and Systems 2020, pages 3569–3579. 2020.
- AtlasNet: A Papier-Mâché Approach to Learning 3D Surface Generation. In Proceedings IEEE Conf. on Computer Vision and Pattern Recognition (CVPR), 2018.
- High-resolution shape completion using deep neural networks for global structure and local geometry inference. In 2017 IEEE International Conference on Computer Vision (ICCV), pages 85–93, Los Alamitos, CA, USA, 2017. IEEE Computer Society.
- Denoising diffusion probabilistic models, 2020.
- Video diffusion models. arXiv:2204.03458, 2022.
- Diffpose: Multi-hypothesis human pose estimation using diffusion models. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 15977–15987, 2023.
- Learning compositional representation for 4d captures with neural ode. In CVPR, 2021.
- Elucidating the design space of diffusion-based generative models. In Proc. NeurIPS, 2022.
- Cadex: Learning canonical deformation coordinate space for dynamic surface representation via neural homeomorphism. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022.
- Nap: Neural 3d articulation prior, 2023.
- Learning a model of facial shape and expression from 4D scans. ACM Transactions on Graphics, (Proc. SIGGRAPH Asia), 36(6):194:1–194:17, 2017.
- 4dcomplete: Non-rigid motion estimation beyond the observable surface. IEEE International Conference on Computer Vision (ICCV), 2021.
- Deep marching cubes: Learning explicit surface representations. In Conference on Computer Vision and Pattern Recognition (CVPR), 2018.
- SMPL: A skinned multi-person linear model. ACM Trans. Graphics (Proc. SIGGRAPH Asia), 34(6):248:1–248:16, 2015.
- Diffusion probabilistic models for 3d point cloud generation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2021.
- Occupancy networks: Learning 3d reconstruction in function space. In Proceedings IEEE Conf. on Computer Vision and Pattern Recognition (CVPR), 2019.
- Symbolic music generation with diffusion models, 2021.
- Glide: Towards photorealistic image generation and editing with text-guided diffusion models, 2022.
- Occupancy flow: 4d reconstruction by learning particle dynamics. In International Conference on Computer Vision, 2019.
- STAR: A sparse trained articulated human body regressor. In European Conference on Computer Vision (ECCV), pages 598–613, 2020.
- Npms: Neural parametric models for 3d deformable shapes. arXiv preprint arXiv:2104.00702, 2021.
- Deep mesh reconstruction from single rgb images via topology modification networks. In Proceedings of the IEEE International Conference on Computer Vision, pages 9964–9973, 2019.
- Deepsdf: Learning continuous signed distance functions for shape representation. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2019.
- Convolutional occupancy networks. In European Conference on Computer Vision (ECCV), 2020.
- Pointnet: Deep learning on point sets for 3d classification and segmentation. arXiv preprint arXiv:1612.00593, 2016.
- Diffusion motion: Generate text-guided 3d human motion by diffusion model, 2023.
- Octnet: Learning deep 3d representations at high resolutions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2017.
- High-resolution image synthesis with latent diffusion models, 2021.
- Embodied hands: Modeling and capturing hands and bodies together. ACM Transactions on Graphics, (Proc. SIGGRAPH Asia), 36(6), 2017.
- Phasemp: Robust 3d pose estimation via phase-conditioned human motion prior. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 14725–14737, 2023.
- Learning 3d shape completion from laser scan data with weak supervision. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR). IEEE Computer Society, 2018.
- A skeleton-bridged deep learning approach for generating meshes of complex topologies from single rgb images. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2019.
- Skeletonnet: A topology-preserving solution for learning mesh reconstruction of object surfaces from rgb images, 2021a.
- Sa-convonet: Sign-agnostic optimization of convolutional occupancy networks, 2021b.
- Learning parallel dense correspondence from spatio-temporal descriptors for efficient and robust 4d reconstruction. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 6022–6031, 2021c.
- Dphms: Diffusion parametric head models for depth-based tracking, 2023a.
- Neural shape deformation priors, 2023b.
- Diffuscene: Scene graph denoising diffusion probabilistic model for generative indoor scene synthesis, 2023c.
- Gerald Teschl. Ordinary differential equations and dynamical systems. American Mathematical Soc., 2012.
- Human motion diffusion model. arXiv preprint arXiv:2209.14916, 2022.
- PatchNets: Patch-Based Generalizable Deep Implicit 3D Shape Representations. European Conference on Computer Vision (ECCV), 2020.
- Pixel2mesh: Generating 3d mesh models from single rgb images. In ECCV, 2018.
- Physdiff: Physics-guided human motion diffusion model. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 16010–16021, 2023.
- 3dshape2vecset: A 3d shape representation for neural fields and generative diffusion models. ACM Trans. Graph., 42(4), 2023.
- Motiondiffuse: Text-driven human motion generation with diffusion model, 2022.
- Point transformer. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 16259–16268, 2021.
- Interpolating subdivision for meshes with arbitrary topology. In Proceedings of the 23rd annual conference on Computer graphics and interactive techniques, pages 189–192, 1996.
- 3D menagerie: Modeling the 3D shape and pose of animals. In IEEE Conf. on Computer Vision and Pattern Recognition (CVPR), 2017.