Ponymation: Learning Articulated 3D Animal Motions from Unlabeled Online Videos
Abstract: We introduce a new method for learning a generative model of articulated 3D animal motions from raw, unlabeled online videos. Unlike existing approaches for 3D motion synthesis, our model requires no pose annotations or parametric shape models for training; it learns purely from a collection of unlabeled web video clips, leveraging semantic correspondences distilled from self-supervised image features. At the core of our method is a video Photo-Geometric Auto-Encoding framework that decomposes each training video clip into a set of explicit geometric and photometric representations, including a rest-pose 3D shape, an articulated pose sequence, and texture, with the objective of re-rendering the input video via a differentiable renderer. This decomposition allows us to learn a generative model over the underlying articulated pose sequences akin to a Variational Auto-Encoding (VAE) formulation, but without requiring any external pose annotations. At inference time, we can generate new motion sequences by sampling from the learned motion VAE, and create plausible 4D animations of an animal automatically within seconds given a single input image.
- Text2action: Generative adversarial synthesis from language to action. In ICRA, pages 1–5, 2018.
- Nonrigid structure from motion in trajectory space. In NeurIPS, 2008.
- 2d human pose estimation: New benchmark and state of the art analysis. In CVPR, 2014.
- Norman Badler. Temporal Scene Analysis: Conceptual Descriptions of Object Movements. PhD thesis, Queensland University of Technology, 1975.
- Simulating Humans: Computer Graphics, Animation, and Control. Oxford University Press, 1993.
- Keep it SMPL: Automatic estimation of 3D human pose and shape from a single image. In ECCV, 2016.
- Recovering non-rigid 3d shape from image streams. In CVPR, 2000.
- Emerging properties in self-supervised vision transformers. In ICCV, 2021.
- What shape are dolphins? building 3d morphable models from 2d images. IEEE TPAMI, 2012.
- pi-GAN: Periodic implicit generative adversarial networks for 3d-aware image synthesis. In CVPR, 2021.
- A simple prior-free method for non-rigid structure-from-motion factorization. In CVPR, 2012.
- Performance capture from sparse multi-view video. ACM TOG, 2008.
- Paul Debevec. The light stages and their applications to photoreal digital actors. In SIGGRAPH Asia, 2012.
- Topologically-aware deformation fields for single-view 3d reconstruction. CVPR, 2022.
- The pascal visual object classes challenge: A retrospective. IJCV, 111:98–136, 2015.
- Mps-nerf: Generalizable 3d human rendering from multiview images. IEEE TPAMI, 2022.
- Shape and viewpoints without keypoints. In ECCV, 2020.
- Action2motion: Conditioned generation of 3d human motions. In ACM MM, 2020.
- A recurrent variational autoencoder for human motion synthesis. In BMVC, 2017.
- Multiple View Geometry in Computer Vision. Cambridge University Press, ISBN: 0521540518, second edition, 2004.
- Challencap: Monocular 3d capture of challenging human performances using multi-modal references. In CVPR, pages 11400–11411, 2021.
- Moglow: Probabilistic and controllable motion synthesis using normalising flows. ACM TOG, 39(6):1–14, 2020.
- A hierarchical 3d-motion learning framework for animal spontaneous behavior mapping. Nature communications, 12(1):2784, 2021.
- Human3.6m: Large scale datasets and predictive methods for 3d human sensing in natural environments. IEEE TPAMI, 36(7):1325–1339, 2014.
- Farm3D: Learning articulated 3D animals by distilling 2D diffusion. In 3DV, 2024.
- End-to-end recovery of human shape and pose. In CVPR, 2018a.
- Learning category-specific mesh reconstruction from image collections. In ECCV, 2018b.
- Learning 3d human dynamics from video. In CVPR, 2019.
- Auto-encoding variational bayes. arXiv preprint arXiv:1312.6114, 2013.
- PointRend: Image segmentation as rendering. In CVPR, 2020.
- To the point: Correspondence-driven monocular 3d category reconstruction. In NeurIPS, 2021.
- Canonical surface mapping via geometric cycle consistency. In ICCV, 2019.
- Articulation-aware canonical surface mapping. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 452–461, 2020a.
- Articulation-aware canonical surface mapping. In CVPR, 2020b.
- Online adaptation for consistent mesh reconstruction in the wild. In NeurIPS, 2020a.
- Self-supervised single-view 3d reconstruction via semantic consistency. In ECCV, 2020b.
- Self-supervised single-view 3d reconstruction via semantic consistency. In ECCV, 2020c.
- Learning the depths of moving people by watching frozen people. In CVPR, pages 4521–4530, 2019.
- Human motion modeling using dvgans. arXiv preprint arXiv:1804.10652, 2018.
- SMPL: A skinned multi-person linear model. ACM TOG, 2015.
- Unsupervised learning of object structure and dynamics from videos. NeurIPS, 32, 2019.
- Eadweard Muybridge. The horse in motion, 1887.
- DynamicFusion: Reconstruction and tracking of non-rigid scenes in real-time. In CVPR, 2015.
- HoloGAN: Unsupervised learning of 3d representations from natural images. In ICCV, 2019.
- GIRAFFE: Representing scenes as compositional generative neural feature fields. In CVPR, 2021.
- Differentiable volumetric rendering: Learning implicit 3d representations without 3d supervision. In CVPR, 2020.
- Representing cyclic human motion using functional analysis. Image and Vision Computing, 23:1264–1276, 2005.
- Action-conditioned 3D human motion synthesis with transformer VAE. In ICCV, 2021.
- Temos: Generating diverse human motions from textual descriptions. In ECCV, 2022.
- Inverting generative adversarial renderer for face reconstruction. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 15619–15628, 2021.
- Pifu: Pixel-aligned implicit function for high-resolution clothed human digitization. In ICCV, 2019.
- Pifuhd: Multi-level pixel-aligned implicit function for high-resolution 3d human digitization. In CVPR, 2020.
- GRAF: Generative radiance fields for 3d-aware image synthesis. In NeurIPS, 2020.
- Deep marching tetrahedra: a hybrid representation for high-resolution 3d shape synthesis. In NeurIPS, 2021.
- Make-a-video: Text-to-video generation without text-video data. In ICLR, 2023.
- Scene representation networks: Continuous 3d-structure-aware neural scene representations. In NeurIPS, 2019.
- Deepphase: Periodic autoencoders for learning motion phase manifolds. ACM Trans. Graph., 41(4), 2022.
- Self-supervised keypoint discovery in behavioral videos. arXiv preprint arXiv:2112.05121, 2021.
- Bkind-3d: Self-supervised 3d keypoint discovery from multi-view videos. arXiv preprint arXiv:2212.07401, 2022a.
- Controllable 3d face synthesis with conditional generative occupancy fields. In Advances in Neural Information Processing Systems, 2022b.
- Cgof++: Controllable 3d face synthesis with conditional generative occupancy fields. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2023.
- Raft: Recurrent all-pairs field transforms for optical flow. In Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part II 16, pages 402–419. Springer, 2020.
- Unsupervised learning of object frames by dense equivariant image labelling. NeurIPS, 30, 2017.
- Modeling human locomotion with topologically constrained latent variable models. In Human Motion – Understanding, Modeling, Capture and Animation, pages 104–118, Berlin, Heidelberg, 2007. Springer Berlin Heidelberg.
- Attention is all you need. In NeurIPS, 2017.
- Predrnn: Recurrent neural networks for predictive learning using spatiotemporal lstms. NeurIPS, 30, 2017.
- Unsupervised learning of probably symmetric deformable 3D objects from images in the wild. In CVPR, 2020.
- DOVE: Learning deformable 3d objects by watching videos. arXiv preprint arXiv:2107.10844, 2021a.
- De-rendering the world’s revolutionary artefacts. In CVPR, 2021b.
- MagicPony: Learning articulated 3d animals in the wild. In CVPR, 2023.
- A closed-form solution to non-rigid shape and motion recovery. In ECCV, 2004.
- LASR: Learning articulated shape reconstruction from a monocular video. In CVPR, 2021a.
- ViSER: Video-specific surface embeddings for articulated 3d shape reconstruction. In NeurIPS, 2021b.
- BANMo: Building animatable 3d neural models from many casual videos. In CVPR, 2022a.
- APT-36K: A large-scale benchmark for animal pose estimation and tracking. In NeurIPS Dataset and Benchmark Track, 2022b.
- Lassie: Learning articulated shape from sparse image ensemble via 3d part discovery. In NeurIPS, 2022.
- Predicting 3d human dynamics from video. In ICCV, 2019.
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.
Top Community Prompts
Collections
Sign up for free to add this paper to one or more collections.