Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
149 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
45 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Guess The Unseen: Dynamic 3D Scene Reconstruction from Partial 2D Glimpses (2404.14410v1)

Published 22 Apr 2024 in cs.CV

Abstract: In this paper, we present a method to reconstruct the world and multiple dynamic humans in 3D from a monocular video input. As a key idea, we represent both the world and multiple humans via the recently emerging 3D Gaussian Splatting (3D-GS) representation, enabling to conveniently and efficiently compose and render them together. In particular, we address the scenarios with severely limited and sparse observations in 3D human reconstruction, a common challenge encountered in the real world. To tackle this challenge, we introduce a novel approach to optimize the 3D-GS representation in a canonical space by fusing the sparse cues in the common space, where we leverage a pre-trained 2D diffusion model to synthesize unseen views while keeping the consistency with the observed 2D appearances. We demonstrate our method can reconstruct high-quality animatable 3D humans in various challenging examples, in the presence of occlusion, image crops, few-shot, and extremely sparse observations. After reconstruction, our method is capable of not only rendering the scene in any novel views at arbitrary time instances, but also editing the 3D scene by removing individual humans or applying different motions for each human. Through various experiments, we demonstrate the quality and efficiency of our methods over alternative existing approaches.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (67)
  1. Bilinear spatiotemporal basis models. TOG, 2012.
  2. Video based reconstruction of 3d people models. In CVPR, 2018.
  3. Keep it SMPL: Automatic estimation of 3D human pose and shape from a single image. In ECCV, 2016.
  4. Openpose: Realtime multi-person 2d pose estimation using part affinity fields. TPAMI, 2019.
  5. Generative novel view synthesis with 3d-aware diffusion models. In ICCV, 2023.
  6. Fantasia3d: Disentangling geometry and appearance for high-quality text-to-3d content creation. In ICCV, 2023.
  7. Fast-snarf: A fast deformer for articulated neural fields. TPAMI, 2023.
  8. Snarf: Differentiable forward skinning for animating non-rigid neural implicit shapes. In ICCV, 2021.
  9. Nerdi: Single-view nerf synthesis with language-guided diffusion as general image priors. In CVPR, 2023.
  10. An image is worth one word: Personalizing text-to-image generation using textual inversion. In ICLR, 2023.
  11. Humans in 4D: Reconstructing and tracking humans with transformers. In ICCV, 2023.
  12. Vid2avatar: 3d avatar reconstruction from videos in the wild via self-supervised scene decomposition. In CVPR, 2023.
  13. Multiple View Geometry in Computer Vision. Cambridge University Press, ISBN: 0521540518, second edition, 2004.
  14. Denoising diffusion probabilistic models. In NeurIPS, 2020.
  15. TeCH: Text-guided Reconstruction of Lifelike Clothed Humans. In 3DV, 2024.
  16. Selfrecon: Self reconstruction your digital avatar from monocular video. In CVPR, 2022.
  17. Instantavatar: Learning avatars from monocular video in 60 seconds. In CVPR, 2023.
  18. Neuman: Neural human radiance field from a single video. In ECCV, 2022.
  19. Panoptic studio: A massively multiview system for social interaction capture. TPAMI, 2017.
  20. 3d gaussian splatting for real-time radiance field rendering. In SIGGRAPH, 2023.
  21. Adam: A method for stochastic optimization. In ICLR, 2015.
  22. Segment anything. In ICCV, 2023.
  23. Multi-concept customization of text-to-image diffusion. In CVPR, 2023.
  24. Panoptic Neural Fields: A Semantic Object-Aware Neural Scene Representation. In CVPR, 2022.
  25. Rgbd2: Generative scene synthesis via incremental view inpainting using rgbd diffusion models. In CVPR, 2023.
  26. Tada! text to animatable digital avatars. In 3DV, 2024.
  27. Magic3d: High-resolution text-to-3d content creation. In CVPR, 2023.
  28. Zero-1-to-3: Zero-shot one image to 3d object. In ICCV, 2023.
  29. Smpl: A skinned multi-person linear model. TOG, 2015.
  30. Repaint: Inpainting using denoising diffusion probabilistic models. In CVPR, 2022.
  31. AMASS: Archive of motion capture as surface shapes. In ICCV, 2019.
  32. Realfusion: 360deg reconstruction of any object from a single image. In CVPR, 2023.
  33. Nerf: Representing scenes as neural radiance fields for view synthesis. In ECCV, 2020.
  34. Instant neural graphics primitives with a multiresolution hash encoding. In SIGGRAPH, 2022.
  35. Continuous surface embeddings. In NeurIPS, 2020.
  36. Neural scene graphs for dynamic scenes. In CVPR, 2021.
  37. 3d reconstruction of a moving point from a series of 2d projections. In ECCV, 2010.
  38. Neural body: Implicit neural representations with structured latent codes for novel view synthesis of dynamic humans. In CVPR, 2021.
  39. Dreamfusion: Text-to-3d using 2d diffusion. In ICLR, 2023.
  40. Tracking people by predicting 3D appearance, location & pose. In CVPR, 2022.
  41. Lolnerf: Learn from one look. In CVPR, 2022.
  42. High-resolution image synthesis with latent diffusion models. In CVPR, 2022.
  43. Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In CVPR, 2023.
  44. Photorealistic text-to-image diffusion models with deep language understanding. In NeurIPS, 2022.
  45. Structure-from-motion revisited. In CVPR, 2016.
  46. Diffustereo: High quality human reconstruction via diffusion-based stereo using sparse cameras. In ECCV, 2022.
  47. Novel view synthesis of human interactions from sparse multi-view videos. In SIGGRAPH, 2022.
  48. Monocular, one-stage, regression of multiple 3d people. In ICCV, 2021.
  49. DROID-SLAM: Deep Visual SLAM for Monocular, Stereo, and RGB-D Cameras. In NeurIPS, 2021.
  50. Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In CVPR, 2023.
  51. Neus: Learning neural implicit surfaces by volume rendering for multi-view reconstruction. In NeurIPS, 2021.
  52. Prolificdreamer: High-fidelity and diverse text-to-3d generation with variational score distillation. In NeurIPS, 2023.
  53. Humannerf: Free-viewpoint rendering of moving people from monocular video. In CVPR, 2022.
  54. D^ 2nerf: Self-supervised decoupling of dynamic and static objects from a monocular video. In NeurIPS, 2022.
  55. Rendering humans from object-occluded monocular videos. In ICCV, 2023.
  56. Ghum & ghuml: Generative 3d human shape and articulated pose models. In CVPR, 2020.
  57. Monoperfcap: Human performance capture from monocular video. In SIGGRAPH, 2018.
  58. ViTPose: Simple vision transformer baselines for human pose estimation. In NeurIPS, 2022.
  59. Banmo: Building animatable 3d neural models from many casual videos. In CVPR, 2022.
  60. Hi4d: 4d instance segmentation of close human interaction. In CVPR, 2023.
  61. MonoHuman: Animatable human neural field from monocular video. In CVPR, 2023.
  62. Editable free-viewpoint video using a layered neural representation. In SIGGRAPH, 2021.
  63. Adding conditional control to text-to-image diffusion models. In ICCV, 2023.
  64. The unreasonable effectiveness of deep features as a perceptual metric. In CVPR, 2018.
  65. Sparsefusion: Distilling view-conditioned diffusion for 3d reconstruction. In CVPR, 2023.
  66. Hifa: High-fidelity text-to-3d with advanced diffusion guidance. In ICLR, 2024.
  67. Ewa volume splatting. In Proceedings Visualization, 2001. VIS ’01., 2001.
Citations (1)

Summary

  • The paper presents a framework integrating 3D Gaussian Splatting with 2D diffusion models to reconstruct high-quality, animatable 3D humans from sparse data.
  • It demonstrates robust dynamic scene reconstruction under occlusion, outperforming methods like HumanNeRF with up to 40x faster rendering.
  • The unified representation enables novel view synthesis and scene editing, offering practical benefits for VR, gaming, and cinematic production.

Dynamic 3D Scene Reconstruction from Monocular Video: Analyzing "Guess The Unseen"

In "Guess The Unseen: Dynamic 3D Scene Reconstruction from Partial 2D Glimpses," the authors present a sophisticated approach to reconstructing dynamic 3D scenes using sparse data from monocular video input. This paper draws attention to the challenges in digitizing dynamic human subjects in 3D under scenarios with partial visibility and occlusion—a common issue in real-world applications. The methodology integrates 3D Gaussian Splatting (3D-GS) with a novel optimization approach that leverages pre-trained 2D diffusion models to reconstruct unseen components with higher fidelity than previous models.

Methodological Overview

The authors propose a framework where both static environments and dynamic human figures are represented using a 3D Gaussian Splatting method. This unified representation serves multiple purposes: it facilitates the efficient combination of multiple components for rendering, allows arbitrary viewpoint rendering, and enables dynamic interactions by transforming humans into any posture defined by SMPL parameters.

A significant advancement asserted in this work is its robustness in handling severely limited and sparse 3D observations, often present in monocular settings. By optimizing representations in a canonical space, the authors mitigate the challenges of inconsistent and insufficient data, which commonly plague existing frameworks. The solution leverages a 2D diffusion model to synthesize missing data points seamlessly, preserving the consistency with observed 2D appearances.

Key Contributions and Results

Notably, the results demonstrate that the proposed method successfully reconstructs high-quality, animatable 3D humans even under challenging conditions of occlusion and sparsity in observation. Furthermore, the modeling of the static world along with these dynamic entities provides a comprehensive rendering system capable of novel view synthesis and scene editing, including the removal or repositioning of individual humans.

The experimental evaluations reported suggest a clear advantage of this method over existing alternatives like HumanNeRF and InstantAvatar. Unlike these baselines, which may struggle with occluded scenes or exhibit computational inefficiencies, the proposed method excels with its fast rendering capabilities and reduced computational burden, rendering up to 40 times faster in some instances.

Practical and Theoretical Implications

In terms of practical applications, the ability to accurately reconstruct and animate 3D humans from sparse data holds promise for various fields, such as virtual reality (VR), gaming, and cinematic productions, where realistic motion representation and real-time rendering are critical. From a theoretical standpoint, the integration of diffusion models to infer unseen parts of a scene could influence future work in sparse-view reconstruction tasks, suggesting new pathways to handle incomplete data in 3D vision systems.

Future Directions

The paper sets a precedent for further research by addressing both specific limitations and broader unresolved challenges. Future work may explore the integration and representation of other dynamic elements beyond humans, such as animals or vehicles. Additionally, improvements on diffusing modeled appearances—potentially integrating richer contextual understanding or continuing refinement of shape and posture representations—may pave the way for more nuanced and finely-detailed reconstructions.

Overall, this research contributes a compelling approach to overcoming inherent limitations in dynamic 3D scene reconstructions from monocular video inputs, positioning it as a valuable reference point for ongoing advancements in AI-driven 3D modeling.