Learning the 3D Fauna of the Web (2401.02400v2)
Abstract: Learning 3D models of all animals on the Earth requires massively scaling up existing solutions. With this ultimate goal in mind, we develop 3D-Fauna, an approach that learns a pan-category deformable 3D animal model for more than 100 animal species jointly. One crucial bottleneck of modeling animals is the limited availability of training data, which we overcome by simply learning from 2D Internet images. We show that prior category-specific attempts fail to generalize to rare species with limited training images. We address this challenge by introducing the Semantic Bank of Skinned Models (SBSM), which automatically discovers a small set of base animal shapes by combining geometric inductive priors with semantic knowledge implicitly captured by an off-the-shelf self-supervised feature extractor. To train such a model, we also contribute a new large-scale dataset of diverse animal species. At inference time, given a single image of any quadruped animal, our model reconstructs an articulated 3D mesh in a feed-forward fashion within seconds.
- Pre-train, self-train, distill: A simple recipe for supersizing 3d reconstruction. In CVPR, 2022.
- Keep it SMPL: Automatic estimation of 3D human pose and shape from a single image. In ECCV, 2016.
- Recovering non-rigid 3d shape from image streams. In CVPR, 2000.
- Emerging properties in self-supervised vision transformers. In ICCV, 2021.
- What shape are dolphins? building 3d morphable models from 2d images. IEEE TPAMI, 2012.
- pi-GAN: Periodic implicit generative adversarial networks for 3d-aware image synthesis. In CVPR, 2021.
- Efficient geometry-aware 3D generative adversarial networks. In CVPR, 2022.
- Nerdi: Single-view nerf synthesis with language-guided diffusion as general image priors. In CVPR, 2023.
- The pascal visual object classes challenge: A retrospective. IJCV, 2015.
- Efficient matching of pictorial structures. In CVPR, 2000.
- The representation and matching of pictorial structures. IEEE Trans. on Computers, 1973.
- Shape and viewpoints without keypoints. In ECCV, 2020.
- Humans in 4d: Reconstructing and tracking humans with transformers. In ICCV, 2023.
- Generative adversarial nets. NeurIPS, 2014.
- threestudio: A unified framework for 3d content generation. https://github.com/threestudio-project/threestudio, 2023.
- Multiple View Geometry in Computer Vision. Cambridge University Press, ISBN: 0521540518, second edition, 2004.
- Denoising diffusion probabilistic models. NeurIPS, 2020.
- Shapeclipper: Scalable 3d shape learning from single-view images via geometric and clip-based consistency. In CVPR, 2023.
- Farm3d: Learning articulated 3d animals by distilling 2d diffusion. In 3DV, 2024.
- Panoptic studio: A massively multiview system for social interaction capture. IEEE TPAMI, 2019.
- Learning category-specific mesh reconstruction from image collections. In ECCV, 2018a.
- Learning category-specific mesh reconstruction from image collections. In ECCV, 2018b.
- Analyzing and improving the image quality of stylegan. In CVPR, 2020.
- Alias-free generative adversarial networks. NeurIPS, 2021.
- Collaborative score distillation for consistent visual synthesis. arXiv preprint arXiv:2307.04787, 2023.
- Pointrend: Image segmentation as rendering. In CVPR, 2020.
- Segment anything. In ICCV, 2023.
- Articulation-aware canonical surface mapping. In CVPR, 2020.
- Self-supervised single-view 3d reconstruction via semantic consistency. In ECCV, 2020.
- One-2-3-45: Any single image to 3d mesh in 45 seconds without per-shape optimization. NeurIPS, 2023a.
- Zero-1-to-3: Zero-shot one image to 3d object. In ICCV, 2023b.
- SyncDreamer: Generating multiview-consistent images from a single-view image. arXiv preprint arXiv:2309.03453, 2023c.
- Wonder3d: Single image to 3d using cross-domain diffusion. arXiv preprint arXiv:2310.15008, 2023.
- SMPL: A skinned multi-person linear model. ACM TOG, 2015.
- Realfusion: 360deg reconstruction of any object from a single image. In CVPR, 2023.
- NeRF: Representing scenes as neural radiance fields for view synthesis. In ECCV, 2020.
- Animal kingdom: A large and diverse dataset for animal behavior understanding. In CVPR, 2022.
- HoloGAN: Unsupervised learning of 3d representations from natural images. In ICCV, 2019.
- GIRAFFE: Representing scenes as compositional generative neural feature fields. In CVPR, 2021.
- Dinov2: Learning robust visual features without supervision. arXiv preprint arXiv:2304.07193, 2023.
- Dreamfusion: Text-to-3d using 2d diffusion. ICLR, 2023.
- Magic123: One image to high-quality 3d object generation using both 2d and 3d diffusion priors. arXiv preprint arXiv:2306.17843, 2023.
- Barc: Learning to regress 3d dog shape from images by exploiting breed information. In CVPR, 2022.
- Bite: Beyond priors for improved three-d dog pose estimation. In CVPR, 2023.
- Deep marching tetrahedra: a hybrid representation for high-resolution 3d shape synthesis. NeurIPS, 2021.
- Mvdream: Multi-view diffusion for 3d generation. arXiv preprint arXiv:2308.16512, 2023.
- Texturify: Generating textures on 3d shape surfaces. In ECCV, 2022.
- Common pets in 3d: Dynamic new-view synthesis of real-life deformable categories. In CVPR, 2023.
- Score-based generative modeling through stochastic differential equations. In ICLR, 2021.
- Dreamcraft3d: Hierarchical 3d generation with bootstrapped diffusion prior. arXiv preprint arXiv:2310.16818, 2023.
- Dreamgaussian: Generative gaussian splatting for efficient 3d content creation. arXiv preprint arXiv:2309.16653, 2023.
- Learning non-rigid 3d shape from 2d motion. NeurIPS, 2004.
- State of the art in dense monocular non-rigid 3d reconstruction. In Comput. Graph. Forum, pages 485–520, 2023.
- Implicit mesh reconstruction from unannotated image collections. arXiv preprint arXiv:2007.08504, 2020.
- Attention is all you need. NeurIPS, 2017.
- The Caltech-UCSD Birds-200-2011 Dataset. Technical Report CNS-TR-2011-001, California Institute of Technology, 2011.
- Birds of a feather: Capturing avian shape models from images. In CVPR, 2021.
- Consistent123: Improve consistency for one image to 3d object synthesis. arXiv preprint arXiv:2310.08092, 2023.
- Unsupervised learning of probably symmetric deformable 3d objects from images in the wild. In CVPR, 2020.
- De-rendering the world’s revolutionary artefacts. In CVPR, 2021.
- DOVE: Learning deformable 3d objects by watching videos. IJCV, 2023a.
- Magicpony: Learning articulated 3d animals in the wild. In CVPR, 2023b.
- Zero-shot learning—a comprehensive evaluation of the good, the bad and the ugly. IEEE TPAMI, 2019.
- Animal3d: A comprehensive dataset of 3d animal pose and shape. In ICCV, 2023a.
- DMV3D: Denoising multi-view diffusion using 3d large reconstruction model. arXiv preprint arXiv:2311.09217, 2023b.
- LASR: Learning articulated shape reconstruction from a monocular video. In CVPR, 2021a.
- ViSER: Video-specific surface embeddings for articulated 3d shape reconstruction. In NeurIPS, 2021b.
- BANMo: Building animatable 3d neural models from many casual videos. In CVPR, 2022a.
- Reconstructing animatable categories from videos. In CVPR, 2023a.
- Consistnet: Enforcing 3d consistency for multi-view images diffusion. arXiv preprint arXiv:2310.10343, 2023b.
- Apt-36k: A large-scale benchmark for animal pose estimation and tracking. NeurIPS, 2022b.
- Lassie: Learning articulated shapes from sparse image ensemble via 3d part discovery. NeurIPS, 2022.
- Hi-lassie: High-fidelity articulated shape and skeleton discovery from sparse image ensemble. In CVPR, 2023a.
- Artic3d: Learning robust articulated 3d shapes from noisy web image collections. NeurIPS, 2023b.
- Shelf-supervised mesh prediction in the wild. In CVPR, 2021.
- Seeing a rose in five thousand ways. In CVPR, 2023.
- Unpaired image-to-image translation using cycle-consistent adversarial networks. In ICCV, 2017.
- 3d menagerie: Modeling the 3d shape and pose of animals. In CVPR, 2017.
- Lions and tigers and bears: Capturing non-rigid, 3d, articulated shape from images. In CVPR, 2018.