Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
144 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
46 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

ViewCrafter: Taming Video Diffusion Models for High-fidelity Novel View Synthesis (2409.02048v1)

Published 3 Sep 2024 in cs.CV

Abstract: Despite recent advancements in neural 3D reconstruction, the dependence on dense multi-view captures restricts their broader applicability. In this work, we propose \textbf{ViewCrafter}, a novel method for synthesizing high-fidelity novel views of generic scenes from single or sparse images with the prior of video diffusion model. Our method takes advantage of the powerful generation capabilities of video diffusion model and the coarse 3D clues offered by point-based representation to generate high-quality video frames with precise camera pose control. To further enlarge the generation range of novel views, we tailored an iterative view synthesis strategy together with a camera trajectory planning algorithm to progressively extend the 3D clues and the areas covered by the novel views. With ViewCrafter, we can facilitate various applications, such as immersive experiences with real-time rendering by efficiently optimizing a 3D-GS representation using the reconstructed 3D points and the generated novel views, and scene-level text-to-3D generation for more imaginative content creation. Extensive experiments on diverse datasets demonstrate the strong generalization capability and superior performance of our method in synthesizing high-fidelity and consistent novel views.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (79)
  1. B. Mildenhall, P. P. Srinivasan, M. Tancik, J. T. Barron, R. Ramamoorthi, and R. Ng, “Nerf: Representing scenes as neural radiance fields for view synthesis,” in ECCV, 2020.
  2. B. Kerbl, G. Kopanas, T. Leimkühler, and G. Drettakis, “3d gaussian splatting for real-time radiance field rendering,” ACM TOG, 2023.
  3. O. Wiles, G. Gkioxari, R. Szeliski, and J. Johnson, “Synsin: End-to-end view synthesis from a single image,” in CVPR, 2020.
  4. R. Rombach, P. Esser, and B. Ommer, “Geometry-free view synthesis: Transformers and no 3d priors,” in ICCV, 2021.
  5. C. Rockwell, D. F. Fouhey, and J. Johnson, “Pixelsynth: Generating a 3d-consistent experience from a single image,” in ICCV, 2021.
  6. B. Park, H. Go, and C. Kim, “Bridging implicit and explicit geometric transformation for single-image view synthesis,” IEEE TPAMI, 2024.
  7. T. Zhou, R. Tucker, J. Flynn, G. Fyffe, and N. Snavely, “Stereo magnification: Learning view synthesis using multiplane images,” ACM TOG, 2018.
  8. Y. Han, R. Wang, and J. Yang, “Single-view view synthesis in the wild with learned adaptive multiplane images,” in SIGGRAPH Conference, 2022.
  9. R. Tucker and N. Snavely, “Single-view view synthesis with multiplane images,” in CVPR, 2020.
  10. A. Yu, V. Ye, M. Tancik, and A. Kanazawa, “pixelnerf: Neural radiance fields from one or few images,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021, pp. 4578–4587.
  11. R. Liu, R. Wu, B. Van Hoorick, P. Tokmakov, S. Zakharov, and C. Vondrick, “Zero-1-to-3: Zero-shot one image to 3d object,” in ICCV, 2023.
  12. K. Sargent, Z. Li, T. Shah, C. Herrmann, H.-X. Yu, Y. Zhang, E. R. Chan, D. Lagun, L. Fei-Fei, D. Sun, and J. Wu, “ZeroNVS: Zero-shot 360-degree view synthesis from a single real image,” in CVPR, 2024.
  13. Z. Wang, Z. Yuan, X. Wang, T. Chen, M. Xia, P. Luo, and Y. Shan, “Motionctrl: A unified and flexible motion controller for video generation,” in SIGGRAPH Conference, 2024.
  14. J. Chung, S. Lee, H. Nam, J. Lee, and K. M. Lee, “Luciddreamer: Domain-free generation of 3d gaussian splatting scenes,” arXiv preprint arXiv:2311.13384, 2023.
  15. J. Shriram, A. Trevithick, L. Liu, and R. Ramamoorthi, “Realmdreamer: Text-driven 3d scene generation with inpainting and depth diffusion,” arXiv preprint arXiv:2404.07199, 2024.
  16. P.-Y. Lab and T. A. etc., “Open-sora-plan,” https://github.com/PKU-YuanGroup/Open-Sora-Plan, 2024. [Online]. Available: https://doi.org/10.5281/zenodo.10948109
  17. A. Blattmann, T. Dockhorn, S. Kulal, D. Mendelevitch, M. Kilian, D. Lorenz, Y. Levi, Z. English, V. Voleti, A. Letts et al., “Stable video diffusion: Scaling latent video diffusion models to large datasets,” arXiv preprint arXiv:2311.15127, 2023.
  18. J. Xing, M. Xia, Y. Zhang, H. Chen, X. Wang, T.-T. Wong, and Y. Shan, “Dynamicrafter: Animating open-domain images with video diffusion priors,” arXiv preprint arXiv:2310.12190, 2023.
  19. S. Wang, V. Leroy, Y. Cabon, B. Chidlovskii, and J. Revaud, “Dust3r: Geometric 3d vision made easy,” in CVPR, 2024.
  20. V. Leroy, Y. Cabon, and J. Revaud, “Grounding image matching in 3d with mast3r,” arXiv:2406.09756, 2024.
  21. A. Knapitsch, J. Park, Q.-Y. Zhou, and V. Koltun, “Tanks and temples: Benchmarking large-scale scene reconstruction,” ACM TOG, 2017.
  22. J. Reizenstein, R. Shapovalov, P. Henzler, L. Sbordone, P. Labatut, and D. Novotny, “Common objects in 3d: Large-scale learning and evaluation of real-life 3d category reconstruction,” in ICCV, 2021.
  23. A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin, “Attention is all you need,” in NeurIPS, 2017.
  24. A. Trevithick, M. Chan, M. Stengel, E. Chan, C. Liu, Z. Yu, S. Khamis, M. Chandraker, R. Ramamoorthi, and K. Nagano, “Real-time radiance fields for single-image portrait view synthesis,” ACM TOG, 2023.
  25. W. Yu, Y. Fan, Y. Zhang, X. Wang, F. Yin, Y. Bai, Y.-P. Cao, Y. Shan, Y. Wu, Z. Sun et al., “Nofa: Nerf-based one-shot facial avatar reconstruction,” in SIGGRAPH Conference, 2023.
  26. Y. Hong, K. Zhang, J. Gu, S. Bi, Y. Zhou, D. Liu, F. Liu, K. Sunkavalli, T. Bui, and H. Tan, “Lrm: Large reconstruction model for single image to 3d,” in ICLR, 2024.
  27. D. Charatan, S. L. Li, A. Tagliasacchi, and V. Sitzmann, “pixelsplat: 3d gaussian splats from image pairs for scalable generalizable 3d reconstruction,” in CVPR, 2024.
  28. Y. Chen, H. Xu, C. Zheng, B. Zhuang, M. Pollefeys, A. Geiger, T.-J. Cham, and J. Cai, “Mvsplat: Efficient 3d gaussian splatting from sparse multi-view images,” in ECCV, 2024.
  29. J. Ho, A. Jain, and P. Abbeel, “Denoising diffusion probabilistic models,” in NeurIPS, 2020.
  30. J. Song, C. Meng, and S. Ermon, “Denoising diffusion implicit models,” in ICLR, 2021.
  31. R. Rombach, A. Blattmann, D. Lorenz, P. Esser, and B. Ommer, “High-resolution image synthesis with latent diffusion models,” in CVPR, 2022.
  32. W. Yu, L. Yuan, Y.-P. Cao, X. Gao, X. Li, L. Quan, Y. Shan, and Y. Tian, “Hifi-123: Towards high-fidelity one image to 3d content generation,” in ECCV, 2024.
  33. J. Tang, T. Wang, B. Zhang, T. Zhang, R. Yi, L. Ma, and D. Chen, “Make-it-3d: High-fidelity 3d creation from a single image with diffusion prior,” in ICCV, 2023.
  34. J. Sun, B. Zhang, R. Shao, L. Wang, W. Liu, Z. Xie, and Y. Liu, “Dreamcraft3d: Hierarchical 3d generation with bootstrapped diffusion prior,” in ICLR, 2024.
  35. E. R. Chan, K. Nagano, M. A. Chan, A. W. Bergman, J. J. Park, A. Levy, M. Aittala, S. De Mello, T. Karras, and G. Wetzstein, “Generative novel view synthesis with 3d-aware diffusion models,” in ICCV, 2023.
  36. M. Deitke, D. Schwenk, J. Salvador, L. Weihs, O. Michel, E. VanderBilt, L. Schmidt, K. Ehsani, A. Kembhavi, and A. Farhadi, “Objaverse: A universe of annotated 3d objects,” in CVPR, 2023.
  37. A. X. Chang, T. Funkhouser, L. Guibas, P. Hanrahan, Q. Huang, Z. Li, S. Savarese, M. Savva, S. Song, H. Su et al., “Shapenet: An information-rich 3d model repository,” arXiv preprint arXiv:1512.03012, 2015.
  38. D. Watson, W. Chan, R. M. Brualla, J. Ho, A. Tagliasacchi, and M. Norouzi, “Novel view synthesis with diffusion models,” in ICLR, 2023.
  39. X. Yu, M. Xu, Y. Zhang, H. Liu, C. Ye, Y. Wu, Z. Yan, C. Zhu, Z. Xiong, T. Liang et al., “Mvimgnet: A large-scale dataset of multi-view images,” in CVPR, 2023.
  40. R. Wu, B. Mildenhall, P. Henzler, K. Park, R. Gao, D. Watson, P. P. Srinivasan, D. Verbin, J. T. Barron, B. Poole et al., “Reconfusion: 3d reconstruction with diffusion priors,” in CVPR, 2024.
  41. J. Zhang, X. Li, Z. Wan, C. Wang, and J. Liao, “Text2nerf: Text-driven 3d scene generation with neural radiance fields,” IEEE TVCG, 2024.
  42. L. Zhang, A. Rao, and M. Agrawala, “Adding conditional control to text-to-image diffusion models,” in ICCV, 2023.
  43. C. Mou, X. Wang, L. Xie, Y. Wu, J. Zhang, Z. Qi, and Y. Shan, “T2i-adapter: Learning adapters to dig out more controllable ability for text-to-image diffusion models,” in AAAI, 2024.
  44. Y. Li, H. Liu, Q. Wu, F. Mu, J. Yang, J. Gao, C. Li, and Y. J. Lee, “Gligen: Open-set grounded text-to-image generation,” in CVPR, 2023.
  45. J. Xing, H. Liu, M. Xia, Y. Zhang, X. Wang, Y. Shan, and T.-T. Wong, “Tooncrafter: Generative cartoon interpolation,” arXiv preprint arXiv:2405.17933, 2024.
  46. J. Xing, M. Xia, Y. Liu, Y. Zhang, Y. Zhang, Y. He, H. Liu, H. Chen, X. Cun, X. Wang et al., “Make-your-video: Customized video generation using textual and structural guidance,” IEEE TVCG, 2024.
  47. P. Esser, J. Chiu, P. Atighehchian, J. Granskog, and A. Germanidis, “Structure and content-guided video synthesis with diffusion models,” in ICCV, 2023.
  48. S. Yin, C. Wu, J. Liang, J. Shi, H. Li, G. Ming, and N. Duan, “Dragnuwa: Fine-grained control in video generation by integrating text, image, and trajectory,” arXiv preprint arXiv:2308.08089, 2023.
  49. M. Niu, X. Cun, X. Wang, Y. Zhang, Y. Shan, and Y. Zheng, “Mofa-video: Controllable image animation via generative motion field adaptions in frozen image-to-video diffusion model,” arXiv preprint arXiv:2405.20222, 2024.
  50. E. Peruzzo, V. Goel, D. Xu, X. Xu, Y. Jiang, Z. Wang, H. Shi, and N. Sebe, “Vase: Object-centric appearance and shape manipulation of real videos,” arXiv preprint arXiv:2401.02473, 2024.
  51. Y. Guo, C. Yang, A. Rao, Y. Wang, Y. Qiao, D. Lin, and B. Dai, “Animatediff: Animate your personalized text-to-image diffusion models without specific tuning,” arXiv preprint arXiv:2307.04725, 2023.
  52. E. J. Hu, Y. Shen, P. Wallis, Z. Allen-Zhu, Y. Li, S. Wang, L. Wang, and W. Chen, “Lora: Low-rank adaptation of large language models,” in ICLR, 2022.
  53. N. Müller, K. Schwarz, B. Rössle, L. Porzi, S. R. Bulò, M. Nießner, and P. Kontschieder, “Multidiff: Consistent novel view synthesis from a single image,” in CVPR, 2024.
  54. A. Dai, A. X. Chang, M. Savva, M. Halber, T. Funkhouser, and M. Nießner, “Scannet: Richly-annotated 3d reconstructions of indoor scenes,” in CVPR, 2017.
  55. D. Xu, W. Nie, C. Liu, S. Liu, J. Kautz, Z. Wang, and A. Vahdat, “Camco: Camera-controllable 3d-consistent image-to-video generation,” arXiv preprint arXiv:2406.02509, 2024.
  56. H. He, Y. Xu, Y. Guo, G. Wetzstein, B. Dai, H. Li, and C. Yang, “Cameractrl: Enabling camera control for text-to-video generation,” arXiv preprint arXiv:2404.02101, 2024.
  57. V. Sitzmann, S. Rezchikov, B. Freeman, J. Tenenbaum, and F. Durand, “Light field networks: Neural scene representations with single-evaluation rendering,” in NeurIPS, 2021.
  58. G. Metzer, E. Richardson, O. Patashnik, R. Giryes, and D. Cohen-Or, “Latent-nerf for shape-guided generation of 3d shapes and textures,” in CVPR, 2023.
  59. F. Plastria, “The weiszfeld algorithm: proof, amendments and extensions, ha eiselt and v. marianov (eds.) foundations of location analysis, international series in operations research and management science,” 2011.
  60. A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, J. Clark et al., “Learning transferable visual models from natural language supervision,” in ICML, 2021.
  61. R. Gao, A. Holynski, P. Henzler, A. Brussee, R. Martin-Brualla, P. Srinivasan, J. T. Barron, and B. Poole, “Cat3d: Create anything in 3d with multi-view diffusion models,” arXiv preprint arXiv:2405.10314, 2024.
  62. R. Zeng, Y. Wen, W. Zhao, and Y.-J. Liu, “View planning in robot active vision: A survey of systems, algorithms, and applications,” Computational Visual Media, vol. 6, 2020.
  63. H. Dhami, V. D. Sharma, and P. Tokekar, “Pred-nbv: Prediction-guided next-best-view planning for 3d object reconstruction,” in IROS, 2023.
  64. L. Jin, X. Chen, J. Rückin, and M. Popović, “Neu-nbv: Next best view planning using uncertainty estimation in image-based neural rendering,” in IROS, 2023.
  65. S. Isler, R. Sabzevari, J. Delmerico, and D. Scaramuzza, “An information gain formulation for active volumetric 3d reconstruction,” in ICRA, 2016.
  66. D. Peralta, J. Casimiro, A. M. Nilles, J. A. Aguilar, R. Atienza, and R. Cajote, “Next-best view policy for 3d reconstruction,” in ECCVW, 2020.
  67. M.-L. Shih, S.-Y. Su, J. Kopf, and J.-B. Huang, “3d photography using context-aware layered depth inpainting,” in CVPR, 2020.
  68. L. Ling, Y. Sheng, Z. Tu, W. Zhao, C. Xin, K. Wan, L. Yu, Q. Guo, Z. Yu, Y. Lu et al., “Dl3dv-10k: A large-scale scene dataset for deep learning-based 3d vision,” in CVPR, 2024.
  69. N. Ravi, J. Reizenstein, D. Novotny, T. Gordon, W.-Y. Lo, J. Johnson, and G. Gkioxari, “Accelerating 3d deep learning with pytorch3d,” arXiv:2007.08501, 2020.
  70. J. Ho and T. Salimans, “Classifier-free diffusion guidance,” arXiv preprint arXiv:2207.12598, 2022.
  71. Z. Wang, A. C. Bovik, H. R. Sheikh, and E. P. Simoncelli, “Image quality assessment: from error visibility to structural similarity,” IEEE TIP, vol. 13, no. 4, pp. 600–612, 2004.
  72. R. Zhang, P. Isola, A. A. Efros, E. Shechtman, and O. Wang, “The unreasonable effectiveness of deep features as a perceptual metric,” in CVPR, 2018.
  73. M. Heusel, H. Ramsauer, T. Unterthiner, B. Nessler, and S. Hochreiter, “Gans trained by a two time-scale update rule converge to a local nash equilibrium,” in NeurIPS, 2017.
  74. J. L. Schönberger and J.-M. Frahm, “Structure-from-motion revisited,” in CVPR, 2016.
  75. J. Li, J. Zhang, X. Bai, J. Zheng, X. Ning, J. Zhou, and L. Gu, “Dngaussian: Optimizing sparse-view 3d gaussian radiance fields with global-local depth normalization,” in CVPR, 2024.
  76. Z. Zhu, Z. Fan, Y. Jiang, and Z. Wang, “Fsgs: Real-time few-shot view synthesis using gaussian splatting,” in ECCV, 2024.
  77. Z. Fan, W. Cong, K. Wen, K. Wang, J. Zhang, X. Ding, D. Xu, B. Ivanovic, M. Pavone, G. Pavlakos et al., “Instantsplat: Unbounded sparse-view pose-free gaussian splatting in 40 seconds,” arXiv:2403.20309, 2024.
  78. S. Bahmani, I. Skorokhodov, A. Siarohin, W. Menapace, G. Qian, M. Vasilkovsky, H.-Y. Lee, C. Wang, J. Zou, A. Tagliasacchi et al., “Vd3d: Taming large video diffusion transformers for 3d camera control,” arXiv preprint arXiv:2407.12781, 2024.
  79. Y.-B. Jia, “Plücker coordinates for lines in the space,” Problem Solver Techniques for Applied Computer Science, Com-S-477/577 Course Handout, 2020.
Citations (18)

Summary

  • The paper introduces ViewCrafter, a method that combines point-conditioned video diffusion with point cloud reconstruction for high-fidelity novel view synthesis from sparse inputs.
  • It employs an iterative view synthesis strategy with adaptive camera trajectory planning to improve scene reconstruction and maintain precise pose control.
  • Experimental results on Tanks-and-Temples, RealEstate10K, and CO3D datasets demonstrate significant improvements in LPIPS, PSNR, SSIM, and FID over previous state-of-the-art methods.

The paper introduces ViewCrafter, a novel method designed for high-fidelity novel view synthesis of generic scenes from single or sparse images, leveraging the capabilities of video diffusion models while maintaining precise camera pose control. The method integrates the generative power of video diffusion models with coarse 3D priors offered by point-based representations to generate high-quality video frames with accurate camera pose control.

The core of ViewCrafter is a point-conditioned video diffusion model that generates consistent videos under a novel view trajectory, conditioned on frames rendered from a point cloud reconstructed from single or sparse images. This addresses the limitations of existing 3D neural reconstruction techniques, such as Neural Radiance Fields (NeRF) and 3D Gaussian Splatting (3D-GS), which require dense multi-view captures.

Key components and strategies include:

  • Point Cloud Reconstruction: A point cloud representation is established from the reference image(s) using a dense stereo model. In cases where only a single input image is available, the input image is duplicated to create a paired input, and then its point map and camera intrinsics are estimated. For more than two input images, global point map alignment is performed with a few optimization iterations. The colored point cloud is obtained by integrating the point maps with their corresponding RGB images.
  • Point-Conditioned Video Diffusion Model: The method learns a conditional distribution p(x  Iref,P)p(\bm{x}~|~\mathbf{I}^{\text{ref}},\mathbf{P}) to produce novel views x={x0,...,xL1}\bm{x} = \{\bm{x}^{0},...,\bm{x}^{L-1}\} based on point cloud renders P\mathbf{P} and reference image(s) Iref\mathbf{I}^{\text{ref}}, where x\bm{x} denotes the video frames. The architecture inherits the Latent Diffusion Model (LDM) architecture, including a VAE encoder E\mathcal{E} and decoder D\mathcal{D} for image compression, a video denoising U-Net with spatial and temporal layers, and a CLIP image encoder for reference image understanding. Point cloud renders are encoded using E\mathcal{E} and concatenated with noise in the channel dimension within the video denoising U-Net.
  • Iterative View Synthesis: To address the challenges of generating long videos, an iterative view synthesis strategy is employed, coupled with a content-adaptive camera trajectory planning algorithm. The camera is navigated from one of the reference views to a target camera pose to reveal occlusions and missing regions of the current point cloud. Novel views are generated using ViewCrafter, and the generated views are back-projected to complete the point cloud.
  • Camera Trajectory Planning: A Next-Best-View (NBV)-based camera trajectory planning algorithm is designed to generate adaptive camera trajectories tailored to different scene types. Starting with the input reference image(s) Iref\mathcal{I}_{\text{ref}}, an initial point cloud $\mathcal{P}_{\text{ref}$ is constructed. The camera trajectory is initialized from one of the reference camera poses $\mathcal{C}_{\text{ref}$. Candidate camera poses $\mathcal{C}_{\text{can}={\{\mathcal{C}^{1}_{\text{can},...,{\mathcal{C}^{K}_{\text{can}\}$ are sampled from the searching space surrounding the current camera pose $\mathcal{C}_{\text{curr} = \mathcal{C}_{\text{ref}$, and a set of candidate masks $\mathcal{M}_{\text{can}$ are rendered from the current point cloud $\mathcal{P}_{\text{curr}$. A utility function F()\mathcal{F}(\cdot) is established to determine the optimal camera pose for the subsequent step. $\mathcal{F}(\mathcal{C})=\left{ \begin{aligned} &amp;\frac{\rm{sum}(\mathcal{M_{C})}{W\times H}, \frac{\rm{sum}(\mathcal{M_{C})}{W\times H} &lt; \Theta \ &amp;1-\frac{\rm{sum}(\mathcal{M_{C})}{W\times H}, \frac{\rm{sum}(\mathcal{M_{C})}{W\times H} &gt; \Theta, \ \end{aligned} \right.$, whereCC<em>can\mathcal{C}\in \mathcal{C}<em>{\text{can}},$\mathcal{M</em>{C} \in \mathcal{M}<em>{\text{can}}$, and$\text{sum}(\mathcal{M</em>{C}) = \sum_{u=0}<sup>{W}</sup> \sum_{v=0}<sup>{H}\mathcal{M_{C}(u,v)$.

The paper also explores applications of ViewCrafter, including efficient optimization of a 3D-GS representation for real-time rendering and scene-level text-to-3D generation. In 3D-GS optimization, the iterative view synthesis strategy is used to complete the initial point cloud and synthesize novel views. The attributes of each 3D Gaussian are optimized under the supervision of the synthesized novel views, deprecating the densification, splitting, and opacity reset tricks, and reducing the optimization time to 2,000 iterations. For text-to-3D generation, a text-to-image diffusion model is used to generate a reference image from a text prompt, followed by ViewCrafter for novel view synthesis and 3D reconstruction.

The method was evaluated on the Tanks-and-Temples, RealEstate10K, and CO3D datasets. In zero-shot novel view synthesis, ViewCrafter outperformed baselines in image quality and pose accuracy metrics, demonstrating its ability to synthesize high-fidelity novel views and achieve precise pose control. In 3D-GS reconstruction, ViewCrafter surpassed previous state-of-the-art methods, validating its effectiveness in scene reconstruction from sparse views. On Tanks and Temples dataset, ViewCrafter achieved LPIPS \downarrow 0.194, PSNR \uparrow 21.26, SSIM \uparrow 0.655, FID \downarrow 27.18 on the "easy" set and LPIPS \downarrow 0.283, PSNR \uparrow 18.07, SSIM \uparrow 0.563, FID \downarrow 38.92 on the "hard" set. In comparison, the next best method, MotionCtrl, achieved LPIPS \downarrow 0.400, PSNR \uparrow 15.34, SSIM \uparrow 0.427, FID \downarrow 70.3 on the "easy" set and LPIPS \downarrow 0.473, PSNR \uparrow 13.29, SSIM \uparrow 0.384, FID \downarrow 196.8 on the "hard" set.

Ablation studies compared the point cloud-based pose condition strategy with Plücker coordinates, demonstrating that ViewCrafter achieves more accurate pose control. The studies also confirmed the robustness of ViewCrafter to imperfect point cloud conditions and validated the effectiveness of the training paradigm and the camera trajectory planning algorithm.

Youtube Logo Streamline Icon: https://streamlinehq.com
Reddit Logo Streamline Icon: https://streamlinehq.com