Papers
Topics
Authors
Recent
2000 character limit reached

LVSM: A Large View Synthesis Model with Minimal 3D Inductive Bias (2410.17242v2)

Published 22 Oct 2024 in cs.CV, cs.GR, and cs.LG

Abstract: We propose the Large View Synthesis Model (LVSM), a novel transformer-based approach for scalable and generalizable novel view synthesis from sparse-view inputs. We introduce two architectures: (1) an encoder-decoder LVSM, which encodes input image tokens into a fixed number of 1D latent tokens, functioning as a fully learned scene representation, and decodes novel-view images from them; and (2) a decoder-only LVSM, which directly maps input images to novel-view outputs, completely eliminating intermediate scene representations. Both models bypass the 3D inductive biases used in previous methods -- from 3D representations (e.g., NeRF, 3DGS) to network designs (e.g., epipolar projections, plane sweeps) -- addressing novel view synthesis with a fully data-driven approach. While the encoder-decoder model offers faster inference due to its independent latent representation, the decoder-only LVSM achieves superior quality, scalability, and zero-shot generalization, outperforming previous state-of-the-art methods by 1.5 to 3.5 dB PSNR. Comprehensive evaluations across multiple datasets demonstrate that both LVSM variants achieve state-of-the-art novel view synthesis quality. Notably, our models surpass all previous methods even with reduced computational resources (1-2 GPUs). Please see our website for more details: https://haian-jin.github.io/projects/LVSM/ .

Definition Search Book Streamline Icon: https://streamlinehq.com
References (76)
  1. Mip-nerf: A multiscale representation for anti-aliasing neural radiance fields. ICCV, 2021.
  2. Zip-nerf: Anti-aliased grid-based neural radiance fields. ICCV, 2023.
  3. Genie: Generative interactive environments. In Forty-first International Conference on Machine Learning, 2024.
  4. Efficient geometry-aware 3d generative adversarial networks. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp.  16123–16133, 2022.
  5. pixelsplat: 3d gaussian splats from image pairs for scalable generalizable 3d reconstruction. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.  19457–19467, 2024.
  6. Depth synthesis and local warps for plausible image-based navigation. ACM transactions on graphics (TOG), 32(3):1–12, 2013.
  7. Mvsnerf: Fast generalizable radiance field reconstruction from multi-view stereo, 2021. URL https://arxiv.org/abs/2103.15595.
  8. Tensorf: Tensorial radiance fields. In European conference on computer vision, pp.  333–350. Springer, 2022.
  9. Factor fields: A unified framework for neural fields and beyond. arXiv preprint arXiv:2302.01226, 2023.
  10. Training deep nets with sublinear memory cost. arXiv preprint arXiv:1604.06174, 2016.
  11. Mvsplat: Efficient 3d gaussian splatting from sparse multi-view images, 2024. URL https://arxiv.org/abs/2403.14627.
  12. Extreme view synthesis. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp.  7781–7790, 2019.
  13. Abo: Dataset and benchmarks for real-world 3d object understanding, 2022a. URL https://arxiv.org/abs/2110.06199.
  14. Abo: Dataset and benchmarks for real-world 3d object understanding. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp.  21126–21136, 2022b.
  15. Tri Dao. Flashattention-2: Faster attention with better parallelism and work partitioning. arXiv preprint arXiv:2307.08691, 2023.
  16. Unstructured light fields. In Computer Graphics Forum, pp.  305–314, 2012.
  17. Modeling and rendering architecture from photographs: A hybrid geometry- and image-based approach. Seminal Graphics Papers: Pushing the Boundaries, Volume 2, 1996. URL https://api.semanticscholar.org/CorpusID:2609415.
  18. Objaverse: A universe of annotated 3d objects. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.  13142–13153, 2023.
  19. Jacob Devlin. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805, 2018.
  20. Alexey Dosovitskiy. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929, 2020.
  21. Google scanned objects: A high-quality dataset of 3d scanned household items, 2022. URL https://arxiv.org/abs/2204.11918.
  22. Learning to render novel views from wide-baseline stereo pairs, 2023. URL https://arxiv.org/abs/2304.08463.
  23. Scaling rectified flow transformers for high-resolution image synthesis. In Forty-first International Conference on Machine Learning, 2024.
  24. Neural points: Point cloud representation with neural fields for arbitrary upsampling. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.  18633–18642, 2022.
  25. Plenoxels: Radiance fields without neural networks. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp.  5501–5510, 2022.
  26. The lumigraph. Proceedings of the 23rd annual conference on Computer graphics and interactive techniques, 1996. URL https://api.semanticscholar.org/CorpusID:2036193.
  27. Deep blending for free-viewpoint image-based rendering. ACM Transactions on Graphics (ToG), 37(6):1–15, 2018.
  28. Baking neural radiance fields for real-time view synthesis. In Proceedings of the IEEE/CVF international conference on computer vision, pp.  5875–5884, 2021.
  29. Plenoptic modeling and rendering from image sequences taken by a hand-held camera. In Mustererkennung 1999: 21. DAGM-Symposium Bonn, 15.–17. September 1999, pp.  94–101. Springer, 1999.
  30. Query-key normalization for transformers. arXiv preprint arXiv:2010.04245, 2020.
  31. Lrm: Large reconstruction model for single image to 3d, 2024. URL https://arxiv.org/abs/2311.04400.
  32. Perceiver io: A general architecture for structured inputs & outputs. arXiv preprint arXiv:2107.14795, 2021.
  33. Multi-view reconstruction preserving weakly-supported surfaces. In CVPR 2011, pp.  3121–3128. IEEE, 2011.
  34. Leap: Liberate sparse-view 3d modeling from camera poses. arXiv preprint arXiv:2310.01410, 2023.
  35. Few-view object reconstruction with unknown categories and camera poses. In 2024 International Conference on 3D Vision (3DV), pp.  31–41. IEEE, 2024.
  36. Geonerf: Generalizing nerf with geometry priors, 2022. URL https://arxiv.org/abs/2111.13539.
  37. Perceptual losses for real-time style transfer and super-resolution. In Computer Vision–ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, October 11-14, 2016, Proceedings, Part II 14, pp.  694–711. Springer, 2016.
  38. 3d gaussian splatting for real-time radiance field rendering. ACM Transactions on Graphics, 42(4), July 2023. URL https://repo-sam.inria.fr/fungraph/3d-gaussian-splatting/.
  39. Diederik P Kingma. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
  40. xformers: A modular and hackable transformer modelling library, 2022.
  41. Light field rendering. Proceedings of the 23rd annual conference on Computer graphics and interactive techniques, 1996. URL https://api.semanticscholar.org/CorpusID:1363510.
  42. Instant3d: Fast text-to-3d with sparse-view generation and large reconstruction model, 2023. URL https://arxiv.org/abs/2311.06214.
  43. Neural sparse voxel fields. Advances in Neural Information Processing Systems, 33:15651–15663, 2020.
  44. Neural rays for occlusion-aware image-based rendering. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.  7824–7833, 2022.
  45. Nerf in the wild: Neural radiance fields for unconstrained photo collections. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp.  7210–7219, 2021.
  46. Nerf: Representing scenes as neural radiance fields for view synthesis, 2020. URL https://arxiv.org/abs/2003.08934.
  47. Instant neural graphics primitives with a multiresolution hash encoding. ACM Transactions on Graphics, 41(4):1–15, July 2022. ISSN 1557-7368. doi: 10.1145/3528223.3530127. URL http://dx.doi.org/10.1145/3528223.3530127.
  48. Regnerf: Regularizing neural radiance fields for view synthesis from sparse inputs. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.  5480–5490, 2022.
  49. Eric Penner and Li Zhang. Soft 3d reconstruction for view synthesis. ACM Transactions on Graphics (TOG), 36(6):1–11, 2017.
  50. Julius Plucker. Xvii. on a new geometry of space. Philosophical Transactions of the Royal Society of London, pp.  725–791, 1865.
  51. Language models are unsupervised multitask learners. OpenAI blog, 1(8):9, 2019.
  52. Kilonerf: Speeding up neural radiance fields with thousands of tiny mlps. In Proceedings of the IEEE/CVF international conference on computer vision, pp.  14335–14345, 2021.
  53. Merf: Memory-efficient radiance fields for real-time view synthesis in unbounded scenes. ACM Transactions on Graphics (TOG), 42(4):1–12, 2023.
  54. Geometry-free view synthesis: Transformers and no 3d priors. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp.  14356–14366, 2021.
  55. Scene representation transformer: Geometry-free novel view synthesis through set-latent scene representations. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.  6229–6238, 2022.
  56. Piecewise planar stereo for image-based rendering. In 2009 International Conference on Computer Vision, pp.  1881–1888, 2009.
  57. Light field networks: Neural scene representations with single-evaluation rendering. Advances in Neural Information Processing Systems, 34:19313–19325, 2021.
  58. Generalizable patch-based neural rendering. In European Conference on Computer Vision, pp.  156–174. Springer, 2022a.
  59. Light field neural rendering, 2022b. URL https://arxiv.org/abs/2112.09687.
  60. Direct voxel grid optimization: Super-fast convergence for radiance fields reconstruction. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp.  5459–5469, 2022.
  61. Lgm: Large multi-view gaussian model for high-resolution 3d content creation. arXiv preprint arXiv:2402.05054, 2024.
  62. Attention is all you need, 2023. URL https://arxiv.org/abs/1706.03762.
  63. Ref-NeRF: Structured view-dependent appearance for neural radiance fields. CVPR, 2022.
  64. Pf-lrm: Pose-free large reconstruction model for joint pose and shape prediction. arXiv preprint arXiv:2311.12024, 2023.
  65. Ibrnet: Learning multi-view image-based rendering, 2021a. URL https://arxiv.org/abs/2102.13090.
  66. Nerf–: Neural radiance fields without known camera parameters. arXiv preprint arXiv:2102.07064, 2021b.
  67. Meshlrm: Large reconstruction model for high-quality mesh. arXiv preprint arXiv:2404.12385, 2024.
  68. Lrm-zero: Training large reconstruction models with synthesized data. arXiv preprint arXiv:2406.09371, 2024.
  69. Point-nerf: Point-based neural radiance fields. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp.  5438–5448, 2022.
  70. Dmv3d: Denoising multi-view diffusion using 3d large reconstruction model, 2023.
  71. pixelnerf: Neural radiance fields from one or few images, 2021. URL https://arxiv.org/abs/2012.02190.
  72. Mip-splatting: Alias-free 3d gaussian splatting. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.  19447–19456, 2024.
  73. Gs-lrm: Large reconstruction model for 3d gaussian splatting, 2024. URL https://arxiv.org/abs/2404.19702.
  74. Differentiable point-based radiance fields for efficient view synthesis. In SIGGRAPH Asia 2022 Conference Papers, pp.  1–12, 2022.
  75. View synthesis by appearance flow. In Computer Vision–ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, October 11–14, 2016, Proceedings, Part IV 14, pp.  286–301. Springer, 2016.
  76. Stereo magnification: Learning view synthesis using multiplane images. In SIGGRAPH, 2018.
Citations (5)

Summary

We haven't generated a summary for this paper yet.

Dice Question Streamline Icon: https://streamlinehq.com

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Lightbulb Streamline Icon: https://streamlinehq.com

Continue Learning

We haven't generated follow-up questions for this paper yet.

List To Do Tasks Checklist Streamline Icon: https://streamlinehq.com

Collections

Sign up for free to add this paper to one or more collections.

X Twitter Logo Streamline Icon: https://streamlinehq.com

Tweets

This paper has been mentioned in 4 tweets and received 4 likes.

Upgrade to Pro to view all of the tweets about this paper:

Youtube Logo Streamline Icon: https://streamlinehq.com