Virtual Pets: Animatable Animal Generation in 3D Scenes
Abstract: Toward unlocking the potential of generative models in immersive 4D experiences, we introduce Virtual Pet, a novel pipeline to model realistic and diverse motions for target animal species within a 3D environment. To circumvent the limited availability of 3D motion data aligned with environmental geometry, we leverage monocular internet videos and extract deformable NeRF representations for the foreground and static NeRF representations for the background. For this, we develop a reconstruction strategy, encompassing species-level shared template learning and per-video fine-tuning. Utilizing the reconstructed data, we then train a conditional 3D motion model to learn the trajectory and articulation of foreground animals in the context of 3D backgrounds. We showcase the efficacy of our pipeline with comprehensive qualitative and quantitative evaluations using cat videos. We also demonstrate versatility across unseen cats and indoor environments, producing temporally coherent 4D outputs for enriched virtual experiences.
- Turbosquid 3d models. https://www.turbosquid.com/.
- Language2pose: Natural language grounded pose forecasting. In 3DV, 2019.
- 4d visualization of dynamic events from unconstrained multi-view videos. In CVPR, 2020.
- Hp-gan: Probabilistic 3d human motion prediction via gan. In CVPRW, 2018.
- Understanding hand-object manipulation with grasp types and object attributes. In Robotics: Science and Systems, 2016.
- Scenetex: High-quality texture synthesis for indoor scenes via diffusion priors. arXiv preprint arXiv:2311.17261, 2023a.
- Text2tex: Text-driven texture synthesis via diffusion models. In ICCV, 2023b.
- Fantasia3d: Disentangling geometry and appearance for high-quality text-to-3d content creation. In ICCV, pages 22246–22256, 2023c.
- Sdfusion: Multimodal 3d shape completion, reconstruction, and generation. In CVPR, 2023.
- Blender Online Community. Blender - a 3D modelling and rendering package. Blender Foundation, Stichting Blender Foundation, Amsterdam, 2018.
- Demo2vec: Reasoning object affordances from online videos. In CVPR, 2018.
- People watching: Human actions as a cue for single view geometry. In ECCV, 2012.
- Recurrent network models for human dynamics. In ICCV, 2015.
- 3d-front: 3d furnished rooms with layouts and semantics. In ICCV, 2021.
- Get3d: A generative model of high quality 3d textured shapes learned from images. NeurIPS, 35:31841–31854, 2022.
- Action2motion: Conditioned generation of 3d human motions. In ACM MM, 2020.
- From 3d scene geometry to human workspace. In CVPR, 2011.
- Resolving 3d human pose ambiguities with 3d scene constraints. In ICCV, 2019.
- Affordance prediction via learned object attributes. In ICRA, 2011.
- Gans trained by a two time-scale update rule converge to a local nash equilibrium. NeurIPS, 2017.
- Capturing and inferring dense full-body human-scene contact. In CVPR, 2022.
- Predicting gaze in egocentric video by learning task-dependent attention transition. In ECCV, 2018.
- Structural-rnn: Deep learning on spatio-temporal graphs. In CVPR, 2016.
- Consistent4d: Consistent 360 {{\{{\\\backslash\deg}}\}} dynamic object generation from monocular video. arXiv preprint arXiv:2311.02848, 2023.
- Skinning with dual quaternions. In Proceedings of the 2007 symposium on Interactive 3D graphics and games, pages 39–46, 2007.
- Adam: A Method for Stochastic Optimization. In ICLR, 2015.
- Dancing to music. NeurIPS, 2019.
- Efficient nonlinear markov models for human motion. In CVPR, 2014.
- Ai choreographer: Music conditioned 3d dance generation with aist++. In ICCV, 2021a.
- Neural 3d video synthesis from multi-view video. In CVPR, 2022.
- Putting humans in a scene: Learning affordance in 3d indoor environments. In CVPR, 2019.
- Neural scene flow fields for space-time view synthesis of dynamic scenes. In CVPR, 2021b.
- Magic3d: High-resolution text-to-3d content creation. In CVPR, 2023.
- Joint hand motion and interaction hotspots prediction from egocentric videos. In CVPR, 2022.
- Marching cubes: A high resolution 3d surface construction algorithm. In Seminal graphics: pioneering efforts that shaped the field, 1998.
- Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101, 2017.
- On human motion prediction using recurrent neural networks. In CVPR, 2017.
- Nerf: Representing scenes as neural radiance fields for view synthesis. In ECCV, 2020.
- Autosdf: Shape priors for 3d completion, reconstruction and generation. In CVPR, 2022.
- Grounded human-object interaction hotspots from video. In ICCV, 2019.
- Ego-topo: Environment affordances from egocentric video. In CVPR, 2020.
- Nerfies: Deformable neural radiance fields. In ICCV, 2021a.
- Hypernerf: A higher-dimensional representation for topologically varying neural radiance fields. ACM TOG, 2021b.
- Action-conditioned 3d human motion synthesis with transformer vae. In ICCV, 2021.
- Dreamfusion: Text-to-3d using 2d diffusion. ICLR, 2022.
- Pointnet: Deep learning on point sets for 3d classification and segmentation. In CVPR, 2017.
- Magic123: One image to high-quality 3d object generation using both 2d and 3d diffusion priors. arXiv preprint arXiv:2306.17843, 2023.
- Accelerating 3d deep learning with pytorch3d. arXiv:2007.08501, 2020.
- Texture: Text-guided texturing of 3d shapes. arXiv preprint arXiv:2302.01721, 2023.
- Structure-from-motion revisited. In CVPR, 2016.
- Pixelwise view selection for unstructured multi-view stereo. In ECCV, 2016.
- Text-to-4d dynamic scene generation. arXiv preprint arXiv:2301.11280, 2023.
- Human motion diffusion model. arXiv preprint arXiv:2209.14916, 2022.
- Fourier plenoctrees for dynamic radiance field rendering in real-time. In CVPR, 2022.
- Prolificdreamer: High-fidelity and diverse text-to-3d generation with variational score distillation. arXiv preprint arXiv:2305.16213, 2023.
- Dove: Learning deformable 3d objects by watching videos. IJCV, 2023.
- Space-time neural irradiance fields for free-viewpoint video. In CVPR, 2021.
- Lasr: Learning articulated shape reconstruction from a monocular video. In CVPR, 2021a.
- Viser: Video-specific surface embeddings for articulated 3d shape reconstruction. NeurIPS, 2021b.
- Banmo: Building animatable 3d neural models from many casual videos. In CVPR, 2022.
- Reconstructing animatable categories from videos. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 16995–17005, 2023.
- Volume rendering of neural implicit surfaces. NeurIPS, 2021.
- Affordance diffusion: Synthesizing hand-object interactions. In CVPR, 2023.
- 3dilg: Irregular latent grids for 3d generative modeling. NeurIPS, 2022.
- Editable free-viewpoint video using a layered neural representation. ACM TOG, 2021.
- Place: Proximity learning of articulation and contact in 3d environments. In 3DV. IEEE, 2020a.
- Generating 3d people in scenes without people. In CVPR, 2020b.
- Compositional human-scene interaction synthesis with semantic control. In ECCV, 2022.
- Hifa: High-fidelity text-to-3d with advanced diffusion guidance. arXiv preprint arXiv:2305.18766, 2023.
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.
Top Community Prompts
Collections
Sign up for free to add this paper to one or more collections.