Papers
Topics
Authors
Recent
Search
2000 character limit reached

Virtual Pets: Animatable Animal Generation in 3D Scenes

Published 21 Dec 2023 in cs.CV | (2312.14154v1)

Abstract: Toward unlocking the potential of generative models in immersive 4D experiences, we introduce Virtual Pet, a novel pipeline to model realistic and diverse motions for target animal species within a 3D environment. To circumvent the limited availability of 3D motion data aligned with environmental geometry, we leverage monocular internet videos and extract deformable NeRF representations for the foreground and static NeRF representations for the background. For this, we develop a reconstruction strategy, encompassing species-level shared template learning and per-video fine-tuning. Utilizing the reconstructed data, we then train a conditional 3D motion model to learn the trajectory and articulation of foreground animals in the context of 3D backgrounds. We showcase the efficacy of our pipeline with comprehensive qualitative and quantitative evaluations using cat videos. We also demonstrate versatility across unseen cats and indoor environments, producing temporally coherent 4D outputs for enriched virtual experiences.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (69)
  1. Turbosquid 3d models. https://www.turbosquid.com/.
  2. Language2pose: Natural language grounded pose forecasting. In 3DV, 2019.
  3. 4d visualization of dynamic events from unconstrained multi-view videos. In CVPR, 2020.
  4. Hp-gan: Probabilistic 3d human motion prediction via gan. In CVPRW, 2018.
  5. Understanding hand-object manipulation with grasp types and object attributes. In Robotics: Science and Systems, 2016.
  6. Scenetex: High-quality texture synthesis for indoor scenes via diffusion priors. arXiv preprint arXiv:2311.17261, 2023a.
  7. Text2tex: Text-driven texture synthesis via diffusion models. In ICCV, 2023b.
  8. Fantasia3d: Disentangling geometry and appearance for high-quality text-to-3d content creation. In ICCV, pages 22246–22256, 2023c.
  9. Sdfusion: Multimodal 3d shape completion, reconstruction, and generation. In CVPR, 2023.
  10. Blender Online Community. Blender - a 3D modelling and rendering package. Blender Foundation, Stichting Blender Foundation, Amsterdam, 2018.
  11. Demo2vec: Reasoning object affordances from online videos. In CVPR, 2018.
  12. People watching: Human actions as a cue for single view geometry. In ECCV, 2012.
  13. Recurrent network models for human dynamics. In ICCV, 2015.
  14. 3d-front: 3d furnished rooms with layouts and semantics. In ICCV, 2021.
  15. Get3d: A generative model of high quality 3d textured shapes learned from images. NeurIPS, 35:31841–31854, 2022.
  16. Action2motion: Conditioned generation of 3d human motions. In ACM MM, 2020.
  17. From 3d scene geometry to human workspace. In CVPR, 2011.
  18. Resolving 3d human pose ambiguities with 3d scene constraints. In ICCV, 2019.
  19. Affordance prediction via learned object attributes. In ICRA, 2011.
  20. Gans trained by a two time-scale update rule converge to a local nash equilibrium. NeurIPS, 2017.
  21. Capturing and inferring dense full-body human-scene contact. In CVPR, 2022.
  22. Predicting gaze in egocentric video by learning task-dependent attention transition. In ECCV, 2018.
  23. Structural-rnn: Deep learning on spatio-temporal graphs. In CVPR, 2016.
  24. Consistent4d: Consistent 360 {{\{{\\\backslash\deg}}\}} dynamic object generation from monocular video. arXiv preprint arXiv:2311.02848, 2023.
  25. Skinning with dual quaternions. In Proceedings of the 2007 symposium on Interactive 3D graphics and games, pages 39–46, 2007.
  26. Adam: A Method for Stochastic Optimization. In ICLR, 2015.
  27. Dancing to music. NeurIPS, 2019.
  28. Efficient nonlinear markov models for human motion. In CVPR, 2014.
  29. Ai choreographer: Music conditioned 3d dance generation with aist++. In ICCV, 2021a.
  30. Neural 3d video synthesis from multi-view video. In CVPR, 2022.
  31. Putting humans in a scene: Learning affordance in 3d indoor environments. In CVPR, 2019.
  32. Neural scene flow fields for space-time view synthesis of dynamic scenes. In CVPR, 2021b.
  33. Magic3d: High-resolution text-to-3d content creation. In CVPR, 2023.
  34. Joint hand motion and interaction hotspots prediction from egocentric videos. In CVPR, 2022.
  35. Marching cubes: A high resolution 3d surface construction algorithm. In Seminal graphics: pioneering efforts that shaped the field, 1998.
  36. Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101, 2017.
  37. On human motion prediction using recurrent neural networks. In CVPR, 2017.
  38. Nerf: Representing scenes as neural radiance fields for view synthesis. In ECCV, 2020.
  39. Autosdf: Shape priors for 3d completion, reconstruction and generation. In CVPR, 2022.
  40. Grounded human-object interaction hotspots from video. In ICCV, 2019.
  41. Ego-topo: Environment affordances from egocentric video. In CVPR, 2020.
  42. Nerfies: Deformable neural radiance fields. In ICCV, 2021a.
  43. Hypernerf: A higher-dimensional representation for topologically varying neural radiance fields. ACM TOG, 2021b.
  44. Action-conditioned 3d human motion synthesis with transformer vae. In ICCV, 2021.
  45. Dreamfusion: Text-to-3d using 2d diffusion. ICLR, 2022.
  46. Pointnet: Deep learning on point sets for 3d classification and segmentation. In CVPR, 2017.
  47. Magic123: One image to high-quality 3d object generation using both 2d and 3d diffusion priors. arXiv preprint arXiv:2306.17843, 2023.
  48. Accelerating 3d deep learning with pytorch3d. arXiv:2007.08501, 2020.
  49. Texture: Text-guided texturing of 3d shapes. arXiv preprint arXiv:2302.01721, 2023.
  50. Structure-from-motion revisited. In CVPR, 2016.
  51. Pixelwise view selection for unstructured multi-view stereo. In ECCV, 2016.
  52. Text-to-4d dynamic scene generation. arXiv preprint arXiv:2301.11280, 2023.
  53. Human motion diffusion model. arXiv preprint arXiv:2209.14916, 2022.
  54. Fourier plenoctrees for dynamic radiance field rendering in real-time. In CVPR, 2022.
  55. Prolificdreamer: High-fidelity and diverse text-to-3d generation with variational score distillation. arXiv preprint arXiv:2305.16213, 2023.
  56. Dove: Learning deformable 3d objects by watching videos. IJCV, 2023.
  57. Space-time neural irradiance fields for free-viewpoint video. In CVPR, 2021.
  58. Lasr: Learning articulated shape reconstruction from a monocular video. In CVPR, 2021a.
  59. Viser: Video-specific surface embeddings for articulated 3d shape reconstruction. NeurIPS, 2021b.
  60. Banmo: Building animatable 3d neural models from many casual videos. In CVPR, 2022.
  61. Reconstructing animatable categories from videos. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 16995–17005, 2023.
  62. Volume rendering of neural implicit surfaces. NeurIPS, 2021.
  63. Affordance diffusion: Synthesizing hand-object interactions. In CVPR, 2023.
  64. 3dilg: Irregular latent grids for 3d generative modeling. NeurIPS, 2022.
  65. Editable free-viewpoint video using a layered neural representation. ACM TOG, 2021.
  66. Place: Proximity learning of articulation and contact in 3d environments. In 3DV. IEEE, 2020a.
  67. Generating 3d people in scenes without people. In CVPR, 2020b.
  68. Compositional human-scene interaction synthesis with semantic control. In ECCV, 2022.
  69. Hifa: High-fidelity text-to-3d with advanced diffusion guidance. arXiv preprint arXiv:2305.18766, 2023.
Citations (2)

Summary

  • The paper introduces a pipeline that integrates deformable NeRFs and dual VAEs to reconstruct and animate animal motions in 3D environments.
  • It leverages monocular videos and static scene reconstructions to extract context-aware affordances, ensuring realistic motion generation.
  • The approach enhances immersive experiences in AR/VR, gaming, and film, while promising further refinements in motion quality and interactivity.

Understanding Virtual Pets: A Pipeline for Animatable 3D Animal Motions

Introduction to Virtual Pets

3D modeling has significantly advanced, yet achieving lively and interactive 3D representations remains a challenge, particularly in immersive experiences. To foster rich virtual experiences, the integration of dynamic movements of virtual characters within their environments is crucial. Traditionally, crafting such vivid scenes has been demanding, reliant on the manual efforts of artists and designers, which tends to be costly, not easily scalable, and time-consuming.

A new pipeline called "Virtual Pet" addresses these challenges by modeling realistic and diverse motions of animals within 3D environments. This pipeline circumvents the issue of limited 3D motion data by using monocular internet videos alongside static representations of environments to reconstruct and animate animal motions contextually.

The Pipeline's Components

Reconstruction Strategy

The Virtual Pet pipeline relies on a two-pronged reconstruction approach. First, it uses a deformable Neural Radiance Field (NeRF) to learn a species-specific template. This shared template captures the general shape of the target animal category across a collection of videos. Next, for each individual video, this shared template is fine-tuned to represent the nuances of each animal's motion and shape accurately.

Simultaneously, the background environment is reconstructed using a static NeRF, ensuring that the animal's motions are compatible with the scene. The interaction between the animal's template and the scene's geometry leads to the extraction of cues pertaining to affordance, which aids in realistic and context-aware motion modeling.

Motion Generation Framework

After modeling the shape and the environment, the framework employs a conditional 3D motion model consisting of two Variational Autoencoders (VAEs). The "Trajectory VAE" learns to generate the path an animal would take in the environment, while the "Articulation VAE" captures the body's articulation throughout this path.

These VAEs are trained on the reconstructed data and incorporate environmental considerations such as the distance between the animal and its environment and the shape of the surroundings. The end result is a generative model capable of producing motion sequences that are environment-aware and respect the natural affordances and constraints present within the scene.

Rendering and Texturing

Upon generating the motion sequences, the foreground (animal) and background are initially textureless. To bring them to life, prevailing text-to-image diffusion models are employed to texture both foreground and background meshes based on descriptions provided in natural language.

Finally, the textured meshes coupled with the generated motion sequences are rendered to produce videos that exhibit temporally coherent 4D outputs, enriching the virtual experience even further.

Conclusion and Looking Forward

This innovative approach represents a significant stride toward creating animated 3D virtual characters that are dynamically integrated within their environments. The implications are broad, spanning from enhanced movie production, more immersive AR/VR experiences to increasingly interactive gaming.

The current model focuses on a single animal species, with openings for further enhancements on motion quality and adherence to physical rules. Future work aims to refine the pipeline's ability to capture finer details in motion and appearance, possibly under guidance via natural language inputs, further narrowing the gap between virtual and real-world interactivity.

Paper to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Collections

Sign up for free to add this paper to one or more collections.

Tweets

Sign up for free to view the 2 tweets with 1 like about this paper.