Papers
Topics
Authors
Recent
Search
2000 character limit reached

Ponymation: Learning Articulated 3D Animal Motions from Unlabeled Online Videos

Published 21 Dec 2023 in cs.CV | (2312.13604v3)

Abstract: We introduce a new method for learning a generative model of articulated 3D animal motions from raw, unlabeled online videos. Unlike existing approaches for 3D motion synthesis, our model requires no pose annotations or parametric shape models for training; it learns purely from a collection of unlabeled web video clips, leveraging semantic correspondences distilled from self-supervised image features. At the core of our method is a video Photo-Geometric Auto-Encoding framework that decomposes each training video clip into a set of explicit geometric and photometric representations, including a rest-pose 3D shape, an articulated pose sequence, and texture, with the objective of re-rendering the input video via a differentiable renderer. This decomposition allows us to learn a generative model over the underlying articulated pose sequences akin to a Variational Auto-Encoding (VAE) formulation, but without requiring any external pose annotations. At inference time, we can generate new motion sequences by sampling from the learned motion VAE, and create plausible 4D animations of an animal automatically within seconds given a single input image.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (77)
  1. Text2action: Generative adversarial synthesis from language to action. In ICRA, pages 1–5, 2018.
  2. Nonrigid structure from motion in trajectory space. In NeurIPS, 2008.
  3. 2d human pose estimation: New benchmark and state of the art analysis. In CVPR, 2014.
  4. Norman Badler. Temporal Scene Analysis: Conceptual Descriptions of Object Movements. PhD thesis, Queensland University of Technology, 1975.
  5. Simulating Humans: Computer Graphics, Animation, and Control. Oxford University Press, 1993.
  6. Keep it SMPL: Automatic estimation of 3D human pose and shape from a single image. In ECCV, 2016.
  7. Recovering non-rigid 3d shape from image streams. In CVPR, 2000.
  8. Emerging properties in self-supervised vision transformers. In ICCV, 2021.
  9. What shape are dolphins? building 3d morphable models from 2d images. IEEE TPAMI, 2012.
  10. pi-GAN: Periodic implicit generative adversarial networks for 3d-aware image synthesis. In CVPR, 2021.
  11. A simple prior-free method for non-rigid structure-from-motion factorization. In CVPR, 2012.
  12. Performance capture from sparse multi-view video. ACM TOG, 2008.
  13. Paul Debevec. The light stages and their applications to photoreal digital actors. In SIGGRAPH Asia, 2012.
  14. Topologically-aware deformation fields for single-view 3d reconstruction. CVPR, 2022.
  15. The pascal visual object classes challenge: A retrospective. IJCV, 111:98–136, 2015.
  16. Mps-nerf: Generalizable 3d human rendering from multiview images. IEEE TPAMI, 2022.
  17. Shape and viewpoints without keypoints. In ECCV, 2020.
  18. Action2motion: Conditioned generation of 3d human motions. In ACM MM, 2020.
  19. A recurrent variational autoencoder for human motion synthesis. In BMVC, 2017.
  20. Multiple View Geometry in Computer Vision. Cambridge University Press, ISBN: 0521540518, second edition, 2004.
  21. Challencap: Monocular 3d capture of challenging human performances using multi-modal references. In CVPR, pages 11400–11411, 2021.
  22. Moglow: Probabilistic and controllable motion synthesis using normalising flows. ACM TOG, 39(6):1–14, 2020.
  23. A hierarchical 3d-motion learning framework for animal spontaneous behavior mapping. Nature communications, 12(1):2784, 2021.
  24. Human3.6m: Large scale datasets and predictive methods for 3d human sensing in natural environments. IEEE TPAMI, 36(7):1325–1339, 2014.
  25. Farm3D: Learning articulated 3D animals by distilling 2D diffusion. In 3DV, 2024.
  26. End-to-end recovery of human shape and pose. In CVPR, 2018a.
  27. Learning category-specific mesh reconstruction from image collections. In ECCV, 2018b.
  28. Learning 3d human dynamics from video. In CVPR, 2019.
  29. Auto-encoding variational bayes. arXiv preprint arXiv:1312.6114, 2013.
  30. PointRend: Image segmentation as rendering. In CVPR, 2020.
  31. To the point: Correspondence-driven monocular 3d category reconstruction. In NeurIPS, 2021.
  32. Canonical surface mapping via geometric cycle consistency. In ICCV, 2019.
  33. Articulation-aware canonical surface mapping. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 452–461, 2020a.
  34. Articulation-aware canonical surface mapping. In CVPR, 2020b.
  35. Online adaptation for consistent mesh reconstruction in the wild. In NeurIPS, 2020a.
  36. Self-supervised single-view 3d reconstruction via semantic consistency. In ECCV, 2020b.
  37. Self-supervised single-view 3d reconstruction via semantic consistency. In ECCV, 2020c.
  38. Learning the depths of moving people by watching frozen people. In CVPR, pages 4521–4530, 2019.
  39. Human motion modeling using dvgans. arXiv preprint arXiv:1804.10652, 2018.
  40. SMPL: A skinned multi-person linear model. ACM TOG, 2015.
  41. Unsupervised learning of object structure and dynamics from videos. NeurIPS, 32, 2019.
  42. Eadweard Muybridge. The horse in motion, 1887.
  43. DynamicFusion: Reconstruction and tracking of non-rigid scenes in real-time. In CVPR, 2015.
  44. HoloGAN: Unsupervised learning of 3d representations from natural images. In ICCV, 2019.
  45. GIRAFFE: Representing scenes as compositional generative neural feature fields. In CVPR, 2021.
  46. Differentiable volumetric rendering: Learning implicit 3d representations without 3d supervision. In CVPR, 2020.
  47. Representing cyclic human motion using functional analysis. Image and Vision Computing, 23:1264–1276, 2005.
  48. Action-conditioned 3D human motion synthesis with transformer VAE. In ICCV, 2021.
  49. Temos: Generating diverse human motions from textual descriptions. In ECCV, 2022.
  50. Inverting generative adversarial renderer for face reconstruction. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 15619–15628, 2021.
  51. Pifu: Pixel-aligned implicit function for high-resolution clothed human digitization. In ICCV, 2019.
  52. Pifuhd: Multi-level pixel-aligned implicit function for high-resolution 3d human digitization. In CVPR, 2020.
  53. GRAF: Generative radiance fields for 3d-aware image synthesis. In NeurIPS, 2020.
  54. Deep marching tetrahedra: a hybrid representation for high-resolution 3d shape synthesis. In NeurIPS, 2021.
  55. Make-a-video: Text-to-video generation without text-video data. In ICLR, 2023.
  56. Scene representation networks: Continuous 3d-structure-aware neural scene representations. In NeurIPS, 2019.
  57. Deepphase: Periodic autoencoders for learning motion phase manifolds. ACM Trans. Graph., 41(4), 2022.
  58. Self-supervised keypoint discovery in behavioral videos. arXiv preprint arXiv:2112.05121, 2021.
  59. Bkind-3d: Self-supervised 3d keypoint discovery from multi-view videos. arXiv preprint arXiv:2212.07401, 2022a.
  60. Controllable 3d face synthesis with conditional generative occupancy fields. In Advances in Neural Information Processing Systems, 2022b.
  61. Cgof++: Controllable 3d face synthesis with conditional generative occupancy fields. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2023.
  62. Raft: Recurrent all-pairs field transforms for optical flow. In Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part II 16, pages 402–419. Springer, 2020.
  63. Unsupervised learning of object frames by dense equivariant image labelling. NeurIPS, 30, 2017.
  64. Modeling human locomotion with topologically constrained latent variable models. In Human Motion – Understanding, Modeling, Capture and Animation, pages 104–118, Berlin, Heidelberg, 2007. Springer Berlin Heidelberg.
  65. Attention is all you need. In NeurIPS, 2017.
  66. Predrnn: Recurrent neural networks for predictive learning using spatiotemporal lstms. NeurIPS, 30, 2017.
  67. Unsupervised learning of probably symmetric deformable 3D objects from images in the wild. In CVPR, 2020.
  68. DOVE: Learning deformable 3d objects by watching videos. arXiv preprint arXiv:2107.10844, 2021a.
  69. De-rendering the world’s revolutionary artefacts. In CVPR, 2021b.
  70. MagicPony: Learning articulated 3d animals in the wild. In CVPR, 2023.
  71. A closed-form solution to non-rigid shape and motion recovery. In ECCV, 2004.
  72. LASR: Learning articulated shape reconstruction from a monocular video. In CVPR, 2021a.
  73. ViSER: Video-specific surface embeddings for articulated 3d shape reconstruction. In NeurIPS, 2021b.
  74. BANMo: Building animatable 3d neural models from many casual videos. In CVPR, 2022a.
  75. APT-36K: A large-scale benchmark for animal pose estimation and tracking. In NeurIPS Dataset and Benchmark Track, 2022b.
  76. Lassie: Learning articulated shape from sparse image ensemble via 3d part discovery. In NeurIPS, 2022.
  77. Predicting 3d human dynamics from video. In ICCV, 2019.

Summary

  • The paper introduces a novel model that learns articulated 3D animal motions from raw online videos without relying on pose annotations.
  • It extends the MagicPony framework by incorporating a spatio-temporal transformer VAE to generate accurate 3D reconstructions and animations.
  • The method leverages unlabeled video data to produce plausible animations from a single image, demonstrating competitive results on standard datasets.

Overview

The paper introduces a novel generative model called 'Ponymation', designed for learning 3D animal motions from unlabeled online videos. Unlike previous motion synthesis methodologies that require pose annotations or parametric shape models for training, Ponymation is capable of learning from raw video collections. Utilizing online videos, the model develops a generative model for diverse 3D animations. At its core, it extends from an existing framework, MagicPony, that learns from single images, augmenting it with a training pipeline that includes temporal regulations for better reconstruction accuracy.

Training Data and Model Capabilities

The method leverages available online video data and learns to generate articulated 3D motions alongside a category-specific 3D reconstruction model. This learning occurs without any dependence on pose annotations or shape templates. Given a single test image, the algorithm reconstructs the articulated 3D mesh of the animal and generates plausible animations by drawing from a motion latent space learned during the training process.

Methodology

The process begins by collecting video clips from the internet of various animal categories. These clips are then used to train a spatio-temporal transformer Variational Auto-Encoder (VAE), different from frameworks that focus on individual static images. The architecture of the transformer VAE takes a sequence of images, encodes it into latent space, and decodes a sequence of articulated 3D poses. Training is conducted without explicit pose annotations by minimizing 2D reconstruction losses on video frames.

Contributions and Results

Their approach brings several key contributions to the table. It presents a new method that does not rely on manual supervision for learning complex motion patterns; it innovatively employs a spatio-temporal transformer VAE architecture effective in extracting motion information from videos; and finally, at inference, the model is capable of generating animations of new animal instances from a single image. Compared to baselines trained on static images, this video training framework exhibits improved reconstruction accuracy. Quantitative evaluations on datasets like PASCAL VOC have shown the method’s competitiveness with other methods that utilize more explicit annotations.

The model still has areas for enhancement, particularly concerning the need for predefined bone topology, which may limit its applicability to a broader range of animal species. Future work could aim to discover the articulation structure automatically while training on videos.

Paper to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Collections

Sign up for free to add this paper to one or more collections.

Tweets

Sign up for free to view the 2 tweets with 47 likes about this paper.