Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
156 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
45 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Learning the 3D Fauna of the Web (2401.02400v2)

Published 4 Jan 2024 in cs.CV

Abstract: Learning 3D models of all animals on the Earth requires massively scaling up existing solutions. With this ultimate goal in mind, we develop 3D-Fauna, an approach that learns a pan-category deformable 3D animal model for more than 100 animal species jointly. One crucial bottleneck of modeling animals is the limited availability of training data, which we overcome by simply learning from 2D Internet images. We show that prior category-specific attempts fail to generalize to rare species with limited training images. We address this challenge by introducing the Semantic Bank of Skinned Models (SBSM), which automatically discovers a small set of base animal shapes by combining geometric inductive priors with semantic knowledge implicitly captured by an off-the-shelf self-supervised feature extractor. To train such a model, we also contribute a new large-scale dataset of diverse animal species. At inference time, given a single image of any quadruped animal, our model reconstructs an articulated 3D mesh in a feed-forward fashion within seconds.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (79)
  1. Pre-train, self-train, distill: A simple recipe for supersizing 3d reconstruction. In CVPR, 2022.
  2. Keep it SMPL: Automatic estimation of 3D human pose and shape from a single image. In ECCV, 2016.
  3. Recovering non-rigid 3d shape from image streams. In CVPR, 2000.
  4. Emerging properties in self-supervised vision transformers. In ICCV, 2021.
  5. What shape are dolphins? building 3d morphable models from 2d images. IEEE TPAMI, 2012.
  6. pi-GAN: Periodic implicit generative adversarial networks for 3d-aware image synthesis. In CVPR, 2021.
  7. Efficient geometry-aware 3D generative adversarial networks. In CVPR, 2022.
  8. Nerdi: Single-view nerf synthesis with language-guided diffusion as general image priors. In CVPR, 2023.
  9. The pascal visual object classes challenge: A retrospective. IJCV, 2015.
  10. Efficient matching of pictorial structures. In CVPR, 2000.
  11. The representation and matching of pictorial structures. IEEE Trans. on Computers, 1973.
  12. Shape and viewpoints without keypoints. In ECCV, 2020.
  13. Humans in 4d: Reconstructing and tracking humans with transformers. In ICCV, 2023.
  14. Generative adversarial nets. NeurIPS, 2014.
  15. threestudio: A unified framework for 3d content generation. https://github.com/threestudio-project/threestudio, 2023.
  16. Multiple View Geometry in Computer Vision. Cambridge University Press, ISBN: 0521540518, second edition, 2004.
  17. Denoising diffusion probabilistic models. NeurIPS, 2020.
  18. Shapeclipper: Scalable 3d shape learning from single-view images via geometric and clip-based consistency. In CVPR, 2023.
  19. Farm3d: Learning articulated 3d animals by distilling 2d diffusion. In 3DV, 2024.
  20. Panoptic studio: A massively multiview system for social interaction capture. IEEE TPAMI, 2019.
  21. Learning category-specific mesh reconstruction from image collections. In ECCV, 2018a.
  22. Learning category-specific mesh reconstruction from image collections. In ECCV, 2018b.
  23. Analyzing and improving the image quality of stylegan. In CVPR, 2020.
  24. Alias-free generative adversarial networks. NeurIPS, 2021.
  25. Collaborative score distillation for consistent visual synthesis. arXiv preprint arXiv:2307.04787, 2023.
  26. Pointrend: Image segmentation as rendering. In CVPR, 2020.
  27. Segment anything. In ICCV, 2023.
  28. Articulation-aware canonical surface mapping. In CVPR, 2020.
  29. Self-supervised single-view 3d reconstruction via semantic consistency. In ECCV, 2020.
  30. One-2-3-45: Any single image to 3d mesh in 45 seconds without per-shape optimization. NeurIPS, 2023a.
  31. Zero-1-to-3: Zero-shot one image to 3d object. In ICCV, 2023b.
  32. SyncDreamer: Generating multiview-consistent images from a single-view image. arXiv preprint arXiv:2309.03453, 2023c.
  33. Wonder3d: Single image to 3d using cross-domain diffusion. arXiv preprint arXiv:2310.15008, 2023.
  34. SMPL: A skinned multi-person linear model. ACM TOG, 2015.
  35. Realfusion: 360deg reconstruction of any object from a single image. In CVPR, 2023.
  36. NeRF: Representing scenes as neural radiance fields for view synthesis. In ECCV, 2020.
  37. Animal kingdom: A large and diverse dataset for animal behavior understanding. In CVPR, 2022.
  38. HoloGAN: Unsupervised learning of 3d representations from natural images. In ICCV, 2019.
  39. GIRAFFE: Representing scenes as compositional generative neural feature fields. In CVPR, 2021.
  40. Dinov2: Learning robust visual features without supervision. arXiv preprint arXiv:2304.07193, 2023.
  41. Dreamfusion: Text-to-3d using 2d diffusion. ICLR, 2023.
  42. Magic123: One image to high-quality 3d object generation using both 2d and 3d diffusion priors. arXiv preprint arXiv:2306.17843, 2023.
  43. Barc: Learning to regress 3d dog shape from images by exploiting breed information. In CVPR, 2022.
  44. Bite: Beyond priors for improved three-d dog pose estimation. In CVPR, 2023.
  45. Deep marching tetrahedra: a hybrid representation for high-resolution 3d shape synthesis. NeurIPS, 2021.
  46. Mvdream: Multi-view diffusion for 3d generation. arXiv preprint arXiv:2308.16512, 2023.
  47. Texturify: Generating textures on 3d shape surfaces. In ECCV, 2022.
  48. Common pets in 3d: Dynamic new-view synthesis of real-life deformable categories. In CVPR, 2023.
  49. Score-based generative modeling through stochastic differential equations. In ICLR, 2021.
  50. Dreamcraft3d: Hierarchical 3d generation with bootstrapped diffusion prior. arXiv preprint arXiv:2310.16818, 2023.
  51. Dreamgaussian: Generative gaussian splatting for efficient 3d content creation. arXiv preprint arXiv:2309.16653, 2023.
  52. Learning non-rigid 3d shape from 2d motion. NeurIPS, 2004.
  53. State of the art in dense monocular non-rigid 3d reconstruction. In Comput. Graph. Forum, pages 485–520, 2023.
  54. Implicit mesh reconstruction from unannotated image collections. arXiv preprint arXiv:2007.08504, 2020.
  55. Attention is all you need. NeurIPS, 2017.
  56. The Caltech-UCSD Birds-200-2011 Dataset. Technical Report CNS-TR-2011-001, California Institute of Technology, 2011.
  57. Birds of a feather: Capturing avian shape models from images. In CVPR, 2021.
  58. Consistent123: Improve consistency for one image to 3d object synthesis. arXiv preprint arXiv:2310.08092, 2023.
  59. Unsupervised learning of probably symmetric deformable 3d objects from images in the wild. In CVPR, 2020.
  60. De-rendering the world’s revolutionary artefacts. In CVPR, 2021.
  61. DOVE: Learning deformable 3d objects by watching videos. IJCV, 2023a.
  62. Magicpony: Learning articulated 3d animals in the wild. In CVPR, 2023b.
  63. Zero-shot learning—a comprehensive evaluation of the good, the bad and the ugly. IEEE TPAMI, 2019.
  64. Animal3d: A comprehensive dataset of 3d animal pose and shape. In ICCV, 2023a.
  65. DMV3D: Denoising multi-view diffusion using 3d large reconstruction model. arXiv preprint arXiv:2311.09217, 2023b.
  66. LASR: Learning articulated shape reconstruction from a monocular video. In CVPR, 2021a.
  67. ViSER: Video-specific surface embeddings for articulated 3d shape reconstruction. In NeurIPS, 2021b.
  68. BANMo: Building animatable 3d neural models from many casual videos. In CVPR, 2022a.
  69. Reconstructing animatable categories from videos. In CVPR, 2023a.
  70. Consistnet: Enforcing 3d consistency for multi-view images diffusion. arXiv preprint arXiv:2310.10343, 2023b.
  71. Apt-36k: A large-scale benchmark for animal pose estimation and tracking. NeurIPS, 2022b.
  72. Lassie: Learning articulated shapes from sparse image ensemble via 3d part discovery. NeurIPS, 2022.
  73. Hi-lassie: High-fidelity articulated shape and skeleton discovery from sparse image ensemble. In CVPR, 2023a.
  74. Artic3d: Learning robust articulated 3d shapes from noisy web image collections. NeurIPS, 2023b.
  75. Shelf-supervised mesh prediction in the wild. In CVPR, 2021.
  76. Seeing a rose in five thousand ways. In CVPR, 2023.
  77. Unpaired image-to-image translation using cycle-consistent adversarial networks. In ICCV, 2017.
  78. 3d menagerie: Modeling the 3d shape and pose of animals. In CVPR, 2017.
  79. Lions and tigers and bears: Capturing non-rigid, 3d, articulated shape from images. In CVPR, 2018.
Citations (18)

Summary

  • The paper presents 3D-Fauna, a framework that leverages a Semantic Bank of Skinned Models to jointly model diverse quadruped species from 2D images.
  • The paper employs self-supervised feature extraction and Non-Rigid Structure-from-Motion to reconstruct articulated 3D meshes from single-view internet photos.
  • The paper demonstrates superior performance on a curated Fauna dataset, outperforming existing methods in both qualitative and quantitative evaluations across over 100 species.

Introduction

In the field of computer vision, the ability to reconstruct humans in 3D from images has advanced significantly, facilitating applications like virtual reality, gaming, and animation. This capability, however, has largely been confined to human subjects due to the specific complexities and data requirements involved. A new framework named 3D-Fauna proposes to change this by developing a comprehensive 3D animal model that can handle a broad range of quadruped species based solely on 2D images sourced from the internet.

Semantic Bank of Skinned Models

The development of 3D-Fauna revolves around the Semantic Bank of Skinned Models (SBSM), a novel technique that constructs a joint model for numerous animal species simultaneously. This is vital for capturing rarer animals that have fewer images available for training. The method utilizes a base shape bank and cognitive knowledge extracted via self-supervised feature extractors, essentially learning commonalities and differences in animal shapes to build a pan-category animal model. This model can deform and pose itself aptly to match any given image of a four-legged animal.

Unsupervised Learning Approach

3D-Fauna's training creatively overcomes the absence of multi-view or 3D data for most animals. It uses principles from Non-Rigid Structure-from-Motion and self-supervised features to reconstruct animals from single-view internet images. The technique is further refined with a mask discriminator which helps in generating realistic animal shapes from multiple viewpoints, thus thwarting biases introduced by the typical front-facing internet photos. The training process comprises three stages, focusing sequentially on registering base shapes, then articulation, and finally individual instance detail capture.

Fauna Dataset and Performance

The system was trained on the specially curated Fauna Dataset consisting of images from over 100 species of quadrupeds. After extensive training, 3D-Fauna demonstrated its ability to transform images into detailed, articulated 3D meshes. Comparative analyses revealed that this approach outperformed existing methods in both qualitative and quantitative terms, effectively creating 3D models for animals ranging from commonly photographed species to those barely represented in available data.

Conclusion

3D-Fauna marks a notable advancement in the field of computer vision and animal modeling. It confers the capability to deduce detailed 3D structures of a myriad of quadruped animals from a single image and portends broader applications where understanding and replicating the full diversity of animal shapes and movements is required. While currently limited to animals with a common skeletal plan, namely quadrupeds, and dependent on some level of image curation, 3D-Fauna sets a new bar for future endeavors towards modeling the natural world in three dimensions.

Youtube Logo Streamline Icon: https://streamlinehq.com