LVSM: A Large View Synthesis Model with Minimal 3D Inductive Bias (2410.17242v2)
Abstract: We propose the Large View Synthesis Model (LVSM), a novel transformer-based approach for scalable and generalizable novel view synthesis from sparse-view inputs. We introduce two architectures: (1) an encoder-decoder LVSM, which encodes input image tokens into a fixed number of 1D latent tokens, functioning as a fully learned scene representation, and decodes novel-view images from them; and (2) a decoder-only LVSM, which directly maps input images to novel-view outputs, completely eliminating intermediate scene representations. Both models bypass the 3D inductive biases used in previous methods -- from 3D representations (e.g., NeRF, 3DGS) to network designs (e.g., epipolar projections, plane sweeps) -- addressing novel view synthesis with a fully data-driven approach. While the encoder-decoder model offers faster inference due to its independent latent representation, the decoder-only LVSM achieves superior quality, scalability, and zero-shot generalization, outperforming previous state-of-the-art methods by 1.5 to 3.5 dB PSNR. Comprehensive evaluations across multiple datasets demonstrate that both LVSM variants achieve state-of-the-art novel view synthesis quality. Notably, our models surpass all previous methods even with reduced computational resources (1-2 GPUs). Please see our website for more details: https://haian-jin.github.io/projects/LVSM/ .
- Mip-nerf: A multiscale representation for anti-aliasing neural radiance fields. ICCV, 2021.
- Zip-nerf: Anti-aliased grid-based neural radiance fields. ICCV, 2023.
- Genie: Generative interactive environments. In Forty-first International Conference on Machine Learning, 2024.
- Efficient geometry-aware 3d generative adversarial networks. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 16123–16133, 2022.
- pixelsplat: 3d gaussian splats from image pairs for scalable generalizable 3d reconstruction. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 19457–19467, 2024.
- Depth synthesis and local warps for plausible image-based navigation. ACM transactions on graphics (TOG), 32(3):1–12, 2013.
- Mvsnerf: Fast generalizable radiance field reconstruction from multi-view stereo, 2021. URL https://arxiv.org/abs/2103.15595.
- Tensorf: Tensorial radiance fields. In European conference on computer vision, pp. 333–350. Springer, 2022.
- Factor fields: A unified framework for neural fields and beyond. arXiv preprint arXiv:2302.01226, 2023.
- Training deep nets with sublinear memory cost. arXiv preprint arXiv:1604.06174, 2016.
- Mvsplat: Efficient 3d gaussian splatting from sparse multi-view images, 2024. URL https://arxiv.org/abs/2403.14627.
- Extreme view synthesis. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 7781–7790, 2019.
- Abo: Dataset and benchmarks for real-world 3d object understanding, 2022a. URL https://arxiv.org/abs/2110.06199.
- Abo: Dataset and benchmarks for real-world 3d object understanding. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 21126–21136, 2022b.
- Tri Dao. Flashattention-2: Faster attention with better parallelism and work partitioning. arXiv preprint arXiv:2307.08691, 2023.
- Unstructured light fields. In Computer Graphics Forum, pp. 305–314, 2012.
- Modeling and rendering architecture from photographs: A hybrid geometry- and image-based approach. Seminal Graphics Papers: Pushing the Boundaries, Volume 2, 1996. URL https://api.semanticscholar.org/CorpusID:2609415.
- Objaverse: A universe of annotated 3d objects. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 13142–13153, 2023.
- Jacob Devlin. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805, 2018.
- Alexey Dosovitskiy. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929, 2020.
- Google scanned objects: A high-quality dataset of 3d scanned household items, 2022. URL https://arxiv.org/abs/2204.11918.
- Learning to render novel views from wide-baseline stereo pairs, 2023. URL https://arxiv.org/abs/2304.08463.
- Scaling rectified flow transformers for high-resolution image synthesis. In Forty-first International Conference on Machine Learning, 2024.
- Neural points: Point cloud representation with neural fields for arbitrary upsampling. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 18633–18642, 2022.
- Plenoxels: Radiance fields without neural networks. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 5501–5510, 2022.
- The lumigraph. Proceedings of the 23rd annual conference on Computer graphics and interactive techniques, 1996. URL https://api.semanticscholar.org/CorpusID:2036193.
- Deep blending for free-viewpoint image-based rendering. ACM Transactions on Graphics (ToG), 37(6):1–15, 2018.
- Baking neural radiance fields for real-time view synthesis. In Proceedings of the IEEE/CVF international conference on computer vision, pp. 5875–5884, 2021.
- Plenoptic modeling and rendering from image sequences taken by a hand-held camera. In Mustererkennung 1999: 21. DAGM-Symposium Bonn, 15.–17. September 1999, pp. 94–101. Springer, 1999.
- Query-key normalization for transformers. arXiv preprint arXiv:2010.04245, 2020.
- Lrm: Large reconstruction model for single image to 3d, 2024. URL https://arxiv.org/abs/2311.04400.
- Perceiver io: A general architecture for structured inputs & outputs. arXiv preprint arXiv:2107.14795, 2021.
- Multi-view reconstruction preserving weakly-supported surfaces. In CVPR 2011, pp. 3121–3128. IEEE, 2011.
- Leap: Liberate sparse-view 3d modeling from camera poses. arXiv preprint arXiv:2310.01410, 2023.
- Few-view object reconstruction with unknown categories and camera poses. In 2024 International Conference on 3D Vision (3DV), pp. 31–41. IEEE, 2024.
- Geonerf: Generalizing nerf with geometry priors, 2022. URL https://arxiv.org/abs/2111.13539.
- Perceptual losses for real-time style transfer and super-resolution. In Computer Vision–ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, October 11-14, 2016, Proceedings, Part II 14, pp. 694–711. Springer, 2016.
- 3d gaussian splatting for real-time radiance field rendering. ACM Transactions on Graphics, 42(4), July 2023. URL https://repo-sam.inria.fr/fungraph/3d-gaussian-splatting/.
- Diederik P Kingma. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
- xformers: A modular and hackable transformer modelling library, 2022.
- Light field rendering. Proceedings of the 23rd annual conference on Computer graphics and interactive techniques, 1996. URL https://api.semanticscholar.org/CorpusID:1363510.
- Instant3d: Fast text-to-3d with sparse-view generation and large reconstruction model, 2023. URL https://arxiv.org/abs/2311.06214.
- Neural sparse voxel fields. Advances in Neural Information Processing Systems, 33:15651–15663, 2020.
- Neural rays for occlusion-aware image-based rendering. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 7824–7833, 2022.
- Nerf in the wild: Neural radiance fields for unconstrained photo collections. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 7210–7219, 2021.
- Nerf: Representing scenes as neural radiance fields for view synthesis, 2020. URL https://arxiv.org/abs/2003.08934.
- Instant neural graphics primitives with a multiresolution hash encoding. ACM Transactions on Graphics, 41(4):1–15, July 2022. ISSN 1557-7368. doi: 10.1145/3528223.3530127. URL http://dx.doi.org/10.1145/3528223.3530127.
- Regnerf: Regularizing neural radiance fields for view synthesis from sparse inputs. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5480–5490, 2022.
- Eric Penner and Li Zhang. Soft 3d reconstruction for view synthesis. ACM Transactions on Graphics (TOG), 36(6):1–11, 2017.
- Julius Plucker. Xvii. on a new geometry of space. Philosophical Transactions of the Royal Society of London, pp. 725–791, 1865.
- Language models are unsupervised multitask learners. OpenAI blog, 1(8):9, 2019.
- Kilonerf: Speeding up neural radiance fields with thousands of tiny mlps. In Proceedings of the IEEE/CVF international conference on computer vision, pp. 14335–14345, 2021.
- Merf: Memory-efficient radiance fields for real-time view synthesis in unbounded scenes. ACM Transactions on Graphics (TOG), 42(4):1–12, 2023.
- Geometry-free view synthesis: Transformers and no 3d priors. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 14356–14366, 2021.
- Scene representation transformer: Geometry-free novel view synthesis through set-latent scene representations. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 6229–6238, 2022.
- Piecewise planar stereo for image-based rendering. In 2009 International Conference on Computer Vision, pp. 1881–1888, 2009.
- Light field networks: Neural scene representations with single-evaluation rendering. Advances in Neural Information Processing Systems, 34:19313–19325, 2021.
- Generalizable patch-based neural rendering. In European Conference on Computer Vision, pp. 156–174. Springer, 2022a.
- Light field neural rendering, 2022b. URL https://arxiv.org/abs/2112.09687.
- Direct voxel grid optimization: Super-fast convergence for radiance fields reconstruction. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 5459–5469, 2022.
- Lgm: Large multi-view gaussian model for high-resolution 3d content creation. arXiv preprint arXiv:2402.05054, 2024.
- Attention is all you need, 2023. URL https://arxiv.org/abs/1706.03762.
- Ref-NeRF: Structured view-dependent appearance for neural radiance fields. CVPR, 2022.
- Pf-lrm: Pose-free large reconstruction model for joint pose and shape prediction. arXiv preprint arXiv:2311.12024, 2023.
- Ibrnet: Learning multi-view image-based rendering, 2021a. URL https://arxiv.org/abs/2102.13090.
- Nerf–: Neural radiance fields without known camera parameters. arXiv preprint arXiv:2102.07064, 2021b.
- Meshlrm: Large reconstruction model for high-quality mesh. arXiv preprint arXiv:2404.12385, 2024.
- Lrm-zero: Training large reconstruction models with synthesized data. arXiv preprint arXiv:2406.09371, 2024.
- Point-nerf: Point-based neural radiance fields. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 5438–5448, 2022.
- Dmv3d: Denoising multi-view diffusion using 3d large reconstruction model, 2023.
- pixelnerf: Neural radiance fields from one or few images, 2021. URL https://arxiv.org/abs/2012.02190.
- Mip-splatting: Alias-free 3d gaussian splatting. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 19447–19456, 2024.
- Gs-lrm: Large reconstruction model for 3d gaussian splatting, 2024. URL https://arxiv.org/abs/2404.19702.
- Differentiable point-based radiance fields for efficient view synthesis. In SIGGRAPH Asia 2022 Conference Papers, pp. 1–12, 2022.
- View synthesis by appearance flow. In Computer Vision–ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, October 11–14, 2016, Proceedings, Part IV 14, pp. 286–301. Springer, 2016.
- Stereo magnification: Learning view synthesis using multiplane images. In SIGGRAPH, 2018.
Sponsored by Paperpile, the PDF & BibTeX manager trusted by top AI labs.
Get 30 days freePaper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.
Top Community Prompts
Collections
Sign up for free to add this paper to one or more collections.