MoSca: Dynamic Gaussian Fusion from Casual Videos via 4D Motion Scaffolds (2405.17421v2)
Abstract: We introduce 4D Motion Scaffolds (MoSca), a modern 4D reconstruction system designed to reconstruct and synthesize novel views of dynamic scenes from monocular videos captured casually in the wild. To address such a challenging and ill-posed inverse problem, we leverage prior knowledge from foundational vision models and lift the video data to a novel Motion Scaffold (MoSca) representation, which compactly and smoothly encodes the underlying motions/deformations. The scene geometry and appearance are then disentangled from the deformation field and are encoded by globally fusing the Gaussians anchored onto the MoSca and optimized via Gaussian Splatting. Additionally, camera focal length and poses can be solved using bundle adjustment without the need of any other pose estimation tools. Experiments demonstrate state-of-the-art performance on dynamic rendering benchmarks and its effectiveness on real videos.
- Monocular dynamic view synthesis: A reality check. Advances in Neural Information Processing Systems, 35:33768–33780, 2022.
- Robust dynamic radiance fields. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 13–23, 2023.
- Fast view synthesis of casual videos. arXiv preprint arXiv:2312.02135, 2023.
- 3d gaussian splatting for real-time radiance field rendering. 2023.
- High-quality video view interpolation using a layered representation. ACM Transactions on Graphics (TOG), 2004.
- View and time interpolation in image space. Computer Graphics Forum, 2008.
- X-fields: Implicit neural view-, light- and time-image interpolation. SIGGRAPH Asia, 2020.
- 4d visualization of dynamic events from unconstrained multi-view videos. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2020.
- Neural 3d video synthesis from multi-view video. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2022.
- D-nerf: Neural radiance fields for dynamic scenes. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 10318–10327, 2021.
- K-planes: Explicit radiance fields in space, time, and appearance. arXiv preprint arXiv:2301.10241, 2023.
- Hexplane: A fast representation for dynamic scenes. arXiv preprint arXiv:2301.09632, 2023.
- Hyperreel: High-fidelity 6-dof video with ray-conditioned sampling. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2023.
- Dynamic 3d gaussians: Tracking by persistent dynamic view synthesis. arXiv preprint arXiv:2308.09713, 2023.
- High-fidelity and real-time novel view synthesis for dynamic scenes. In SIGGRAPH Asia Conference Proceedings, 2023.
- Novel view synthesis of dynamic scenes with globally coherent depths from a monocular camera. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 5336–5345, 2020.
- Neural scene flow fields for space-time view synthesis of dynamic scenes. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 6498–6508, 2021.
- Space-time neural irradiance fields for free-viewpoint video. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 9421–9431, 2021.
- Neural trajectory fields for dynamic novel view synthesis. arXiv preprint arXiv:2105.05994, 2021.
- Dynamic view synthesis from dynamic monocular video. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 5712–5721, 2021.
- Non-rigid neural radiance fields: Reconstruction and novel view synthesis of a dynamic scene from monocular video. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 12959–12970, 2021.
- D22\,{}^{2}start_FLOATSUPERSCRIPT 2 end_FLOATSUPERSCRIPT nerf: Self-supervised decoupling of dynamic and static objects from a monocular video. arXiv preprint arXiv:2205.15838, 2022.
- Nerfplayer: A streamable dynamic scene representation with decomposed neural radiance fields. IEEE Transactions on Visualization and Computer Graphics, 2023.
- Mononerf: Learning a generalizable dynamic radiance field from monocular videos. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 17903–17913, 2023.
- Decoupling dynamic monocular videos for dynamic view synthesis. arXiv preprint arXiv:2304.01716, 2023.
- Deformable 3d gaussians for high-fidelity monocular dynamic scene reconstruction. arXiv preprint arXiv:2309.13101, 2023.
- Gaufre: Gaussian deformation fields for real-time dynamic novel view synthesis. arXiv preprint arXiv:2312.11458, 2023.
- Dyblurf: Dynamic deblurring neural radiance fields for blurry monocular video. arXiv preprint arXiv:2312.13528, 2023.
- Ctnerf: Cross-time transformer for dynamic neural radiance field from monocular video. arXiv preprint arXiv:2401.04861, 2024.
- Rignerf: Fully controllable neural 3d portraits. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 20364–20373, 2022.
- Humannerf: Free-viewpoint rendering of moving people from monocular video. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 16210–16220, 2022.
- Rig3dgs: Creating controllable portraits from casual monocular videos. arXiv preprint arXiv:2402.03723, 2024.
- Gva: Reconstructing vivid 3d gaussian avatars from monocular videos, 2024.
- Splattingavatar: Realistic real-time human avatars with mesh-embedded gaussian splatting. arXiv preprint arXiv:2403.05087, 2024.
- Haha: Highly articulated gaussian human avatars with textured mesh prior. arXiv preprint arXiv:2404.01053, 2024.
- Gomavatar: Efficient animatable human modeling from monocular video using gaussians-on-mesh. arXiv preprint arXiv:2404.07991, 2024.
- Gart: Gaussian articulated template models. arXiv preprint arXiv:2311.16099, 2023.
- Hugs: Human gaussian splats. arXiv preprint arXiv:2311.17910, 2023.
- Gauhuman: Articulated gaussian splatting from monocular human videos. arXiv preprint arXiv:2312.02973, 2023.
- Gaussianavatar: Towards realistic human avatar modeling from a single video via animatable 3d gaussians. arXiv preprint arXiv:2312.02134, 2023.
- Monogaussianavatar: Monocular gaussian point-based head avatar. arXiv preprint arXiv:2312.04558, 2023.
- 3dgs-avatar: Animatable avatars via deformable 3d gaussian splatting. arXiv preprint arXiv:2312.09228, 2023.
- Human101: Training 100+ fps human gaussians in 100s from 1 view. arXiv preprint arXiv:2312.15258, 2023.
- Keep it smpl: Automatic estimation of 3d human pose and shape from a single image. In European Conference on Computer Vision, pages 561–578. Springer, 2016.
- Nonlinear 3d face morphable model. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 7346–7355, 2018.
- Dynibar: Neural dynamic image-based rendering, 2023.
- Dynpoint: Dynamic neural point for view synthesis. Advances in Neural Information Processing Systems, 36, 2024.
- Pseudo-generalized dynamic view synthesis from a video, 2024.
- Dreamscene4d: Dynamic multi-object scene generation from monocular videos. arXiv preprint arXiv:2405.02280, 2024.
- Nerf: Representing scenes as neural radiance fields for view synthesis. Communications of the ACM, 65(1):99–106, 2021.
- Mip-nerf: A multiscale representation for anti-aliasing neural radiance fields. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 5855–5864, 2021.
- Tensorf: Tensorial radiance fields. In Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXII, pages 333–350. Springer, 2022.
- Instant neural graphics primitives with a multiresolution hash encoding. arXiv preprint arXiv:2201.05989, 2022.
- Nerfies: Deformable neural radiance fields. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 5865–5874, 2021.
- Hypernerf: A higher-dimensional representation for topologically varying neural radiance fields. arXiv preprint arXiv:2106.13228, 2021.
- Fast dynamic radiance fields with time-aware neural voxels. In SIGGRAPH Asia 2022 Conference Papers, pages 1–9, 2022.
- Gaussian opacity fields: Efficient and compact surface reconstruction in unbounded scenes. arXiv preprint arXiv:2404.10772, 2024.
- Approximate differentiable rendering with algebraic surfaces. In European Conference on Computer Vision, pages 596–614. Springer, 2022.
- Flexible techniques for differentiable rendering with 3d gaussians. arXiv preprint arXiv:2308.14737, 2023.
- 4d gaussian splatting for real-time dynamic scene rendering. arXiv preprint arXiv:2310.08528, 2023.
- Md-splatting: Learning metric deformation from 4d gaussians in highly deformable scenes. arXiv preprint arXiv:2312.00583, 2023.
- Real-time photorealistic dynamic scene representation and rendering with 4d gaussian splatting. arXiv preprint arXiv:2310.10642, 2023.
- An efficient 3d gaussian representation for monocular/multi-view dynamic scenes. arXiv preprint arXiv:2311.12897, 2023.
- Gaussian-flow: 4d reconstruction with dynamic 3d gaussian particle. arXiv preprint arXiv:2312.03431, 2023.
- Spacetime gaussian feature splatting for real-time dynamic view synthesis. arXiv preprint arXiv:2312.16812, 2023.
- Dynmf: Neural motion factorization for real-time dynamic view synthesis with 3d gaussian splatting. arXiv preprint arXiv:2312.00112, 2023.
- Sc-gs: Sparse-controlled gaussian splatting for editable dynamic scenes. arXiv preprint arXiv:2312.14937, 2023.
- Neural parametric gaussians for monocular non-rigid object reconstruction. arXiv preprint arXiv:2312.01196, 2023.
- 4d gaussian splatting: Towards efficient novel view synthesis for dynamic scenes. arXiv preprint arXiv:2402.03307, 2024.
- A morphable model for the synthesis of 3d faces. In Proceedings of the 26th Annual Conference on Computer Graphics and Interactive Techniques, pages 187–194, 1999.
- Reconstructing 3d human pose from 2d image landmarks. In European Conference on Computer Vision, pages 573–586. Springer, 2012.
- Real-time non-rigid reconstruction using an rgb-d camera. ACM Transactions on Graphics (ToG), 33(4):1–12, 2014.
- Lasr: Learning articulated shape reconstruction from a monocular video. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 4925–4935, 2021.
- Banmo: Building animatable 3d neural models from many casual videos. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 22247–22257, 2022.
- A volumetric method for building complex models from range images. In Proceedings of the 23rd annual conference on Computer graphics and interactive techniques, pages 303–312, 1996.
- Global correspondence optimization for non-rigid registration of depth scans. Computer Graphics Forum, 27(5):1421–1430, 2008.
- 3d scanning deformable objects with a single rgbd sensor. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 493–501, 2015.
- Dynamicfusion: Reconstruction and tracking of non-rigid scenes in real-time. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 343–352, 2015.
- Surfelwarp: Efficient non-volumetric single view dynamic reconstruction. In Robotics: Science and Systems (RSS), 2018.
- Neural non-rigid tracking. In Advances in Neural Information Processing Systems, volume 33, pages 18765–18775, 2020.
- Neural radiance flow for 4d view synthesis and video processing. In 2021 IEEE/CVF International Conference on Computer Vision (ICCV), pages 14304–14314. IEEE Computer Society, 2021.
- Embedded deformation for shape manipulation. In ACM siggraph 2007 papers, pages 80–es. 2007.
- On the opportunities and risks of foundation models. arXiv preprint arXiv:2108.07258, 2021.
- Learning transferable visual models from natural language supervision. In International conference on machine learning, pages 8748–8763. PMLR, 2021.
- Dinov2: Learning robust visual features without supervision. arXiv preprint arXiv:2304.07193, 2023.
- Segment anything. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 4015–4026, 2023.
- OpenAI. Gpt-4 technical report, 2023. https://openai.com/research/gpt-4.
- Improved baselines with visual instruction tuning. arXiv preprint arXiv:2310.03744, 2023.
- Visual instruction tuning. Advances in neural information processing systems, 36, 2024.
- Cotracker: It is better to track together. arXiv preprint arXiv:2307.07635, 2023.
- Zoedepth: Zero-shot transfer by combining relative and metric depth. arXiv preprint arXiv:2302.12288, 2023.
- Unidepth: Universal monocular metric depth estimation. arXiv preprint arXiv:2403.18913, 2024.
- Raft: Recurrent all-pairs field transforms for optical flow. In Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part II 16, pages 402–419. Springer, 2020.
- Particle video revisited: Tracking through occlusions using point trajectories. 2022.
- Xmem: Long-term video object segmentation with an atkinson-shiffrin memory model. In European Conference on Computer Vision, pages 640–658. Springer, 2022.
- High-resolution image synthesis with latent diffusion models. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 10684–10695, 2022.
- Yan-Bin Jia. Dual quaternions.
- Skinning with dual quaternions. In Proceedings of the 2007 symposium on Interactive 3D graphics and games, pages 39–46, 2007.
- Konstantinos Daniilidis. Hand-eye calibration using dual quaternions. The International Journal of Robotics Research, 18(3):286–298, 1999.
- Track everything everywhere fast and robustly, 2024.
- Dino-tracker: Taming dino for self-supervised point tracking in a single video, 2024.
- The 2017 davis challenge on video object segmentation. arXiv preprint arXiv:1704.00675, 2017.
- Video generation models as world simulators. 2024.
- Colmap-free 3d gaussian splatting. 2023.