Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
119 tokens/sec
GPT-4o
56 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Free3D: Consistent Novel View Synthesis without 3D Representation (2312.04551v2)

Published 7 Dec 2023 in cs.CV

Abstract: We introduce Free3D, a simple accurate method for monocular open-set novel view synthesis (NVS). Similar to Zero-1-to-3, we start from a pre-trained 2D image generator for generalization, and fine-tune it for NVS. Compared to other works that took a similar approach, we obtain significant improvements without resorting to an explicit 3D representation, which is slow and memory-consuming, and without training an additional network for 3D reconstruction. Our key contribution is to improve the way the target camera pose is encoded in the network, which we do by introducing a new ray conditioning normalization (RCN) layer. The latter injects pose information in the underlying 2D image generator by telling each pixel its viewing direction. We further improve multi-view consistency by using light-weight multi-view attention layers and by sharing generation noise between the different views. We train Free3D on the Objaverse dataset and demonstrate excellent generalization to new categories in new datasets, including OmniObject3D and GSO. The project page is available at https://chuanxiaz.com/free3d/.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (90)
  1. Single lens stereo with a plenoptic camera. IEEE transactions on pattern analysis and machine intelligence (TPAMI), 14(2):99–106, 1992.
  2. The plenoptic function and the elements of early vision. Computational models of visual processing, 1(2):3–20, 1991.
  3. Renderdiffusion: Image diffusion for 3d reconstruction, inpainting and generation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 12608–12618, 2023.
  4. Sequential modeling enables scalable learning for large vision models. arXiv preprint arXiv:2311.15127, 2023.
  5. Mip-nerf: A multiscale representation for anti-aliasing neural radiance fields. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 5855–5864, 2021.
  6. Mip-nerf 360: Unbounded anti-aliased neural radiance fields. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 5470–5479, 2022.
  7. Stable video diffusion: Scaling latent video diffusion models to large datasets. arXiv preprint arXiv:2311.15127, 2023a.
  8. Align your latents: High-resolution video synthesis with latent diffusion models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 22563–22575, 2023b.
  9. GeNVS: Generative novel view synthesis with 3D-aware diffusion models. In Proceedings of the International Conference on Computer Vision (ICCV), 2023.
  10. Maskgit: Masked generative image transformer. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 11315–11325, 2022.
  11. Tensorf: Tensorial radiance fields. In European Conference on Computer Vision (ECCV), pages 333–350. Springer, 2022.
  12. Ray conditioning: Trading photo-realism for photo-consistency in multi-view image generation. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 2023a.
  13. Generative pretraining from pixels. In International conference on machine learning (ICML), pages 1691–1703. PMLR, 2020.
  14. View interpolation for image synthesis. In Proceedings of the 20th Annual Conference on Computer Graphics and Interactive Techniques (SIGGRAPH), page 279–288, New York, NY, USA, 1993. Association for Computing Machinery.
  15. Explicit correspondence matching for generalizable neural radiance fields. arXiv preprint arXiv:2304.12294, 2023b.
  16. Modeling and rendering architecture from photographs: A hybrid geometry-and image-based approach. In Proceedings of the 23th Annual Conference on Computer Graphics and Interactive Techniques (SIGGRAPH), 1996.
  17. Objaverse-xl: A universe of 10m+ 3d objects. Advances in Neural Information Processing Systems (NeurIPS), 2023a.
  18. Objaverse: A universe of annotated 3d objects. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 13142–13153, 2023b.
  19. Diffusion models beat gans on image synthesis. Advances in neural information processing systems (NeurIPS), 34:8780–8794, 2021.
  20. Google scanned objects: A high-quality dataset of 3d scanned household items. In 2022 International Conference on Robotics and Automation (ICRA), pages 2553–2560. IEEE, 2022.
  21. A learned representation for artistic style. In International Conference on Learning Representations (ICLR), 2016.
  22. Imagebart: Bidirectional context with multinomial diffusion for autoregressive image synthesis. Advances in neural information processing systems (NeurIPS), 34:3518–3532, 2021a.
  23. Taming transformers for high-resolution image synthesis. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (CVPR), pages 12873–12883, 2021b.
  24. Plenoxels: Radiance fields without neural networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 5501–5510, 2022.
  25. Generative adversarial nets. Advances in neural information processing systems, 27, 2014.
  26. Animatediff: Animate your personalized text-to-image diffusion models without specific tuning. arXiv preprint arXiv:2307.04725, 2023.
  27. Multiple view geometry in computer vision. Cambridge university press, 2003.
  28. Gans trained by a two time-scale update rule converge to a local nash equilibrium. In Proceedings of the 31st International Conference on Neural Information Processing Systems (NeurIPS), pages 6626–6637, 2017.
  29. Denoising diffusion probabilistic models. Advances in neural information processing systems (NeurIPS), 33:6840–6851, 2020.
  30. Imagen video: High definition video generation with diffusion models. arXiv preprint arXiv:2210.02303, 2022.
  31. Arbitrary style transfer in real-time with adaptive instance normalization. In Proceedings of the IEEE international conference on computer vision (ICCV), pages 1501–1510, 2017.
  32. Planes vs. chairs: Category-guided 3d shape learning without any 3d cues. In European Conference on Computer Vision, pages 727–744. Springer, 2022.
  33. Efficient-3dim: Learning a generalizable single-image novel-view synthesizer in one day. arXiv preprint arXiv:2310.03015, 2023.
  34. Learning category-specific mesh reconstruction from image collections. In Proceedings of the European Conference on Computer Vision (ECCV), pages 371–386, 2018.
  35. invs: Repurposing diffusion inpainters for novel view synthesis. arXiv preprint arXiv:2310.16167, 2023.
  36. Holodiffusion: Training a 3d diffusion model using 2d images. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 18423–18433, 2023.
  37. A style-based generator architecture for generative adversarial networks. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 4401–4410, 2019.
  38. 3d gaussian splatting for real-time radiance field rendering. ACM Transactions on Graphics (ToG), 42(4):1–14, 2023.
  39. Auto-encoding variational bayes. In Proceedings of the International Conference on Learning Representations (ICLR), 2014.
  40. Bart: Denoising sequence-to-sequence pre-training for natural language generation, translation, and comprehension. arXiv preprint arXiv:1910.13461, 2019.
  41. One-2-3-45++: Fast single image to 3d objects with consistent multi-view generation and 3d diffusion. arXiv preprint arXiv:2311.07885, 2023a.
  42. One-2-3-45: Any single image to 3d mesh in 45 seconds without per-shape optimization. arXiv preprint arXiv:2306.16928, 2023b.
  43. Zero-1-to-3: Zero-shot one image to 3d object. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 9298–9309, 2023c.
  44. Syncdreamer: Learning to generate multiview-consistent images from a single-view image. arXiv preprint arXiv:2309.03453, 2023d.
  45. Neural volumes: learning dynamic renderable volumes from images. ACM Transactions on Graphics (TOG), 38(4):1–14, 2019.
  46. Realfusion: 360deg reconstruction of any object from a single image. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 8446–8455, 2023a.
  47. Pc2: Projection-conditioned point cloud diffusion for single-image 3d reconstruction. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 12923–12932, 2023b.
  48. Nerf: Representing scenes as neural radiance fields for view synthesis. In Proceedings of the European conference on computer vision (ECCV), 2020.
  49. Instant neural graphics primitives with a multiresolution hash encoding. ACM Transactions on Graphics (ToG), 41(4):1–15, 2022.
  50. Transformation-grounded image generation network for novel 3d view synthesis. In Proceedings of the ieee conference on computer vision and pattern recognition (CVPR), pages 3500–3509, 2017.
  51. Deepsdf: Learning continuous signed distance functions for shape representation. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (CVPR), pages 165–174, 2019.
  52. Dreamfusion: Text-to-3d using 2d diffusion. In The Eleventh International Conference on Learning Representations (ICLR), 2023.
  53. Conrad: Image constrained radiance fields for 3d generation from a single image. Advances in Neural Information Processing Systems (NeurIPS), 2023.
  54. Magic123: One image to high-quality 3d object generation using both 2d and 3d diffusion priors. arXiv preprint arXiv:2306.17843, 2023.
  55. Learning transferable visual models from natural language supervision. In International conference on machine learning (ICML), pages 8748–8763. PMLR, 2021.
  56. Generating diverse high-fidelity images with vq-vae-2. Advances in neural information processing systems (NeurIPS), 32, 2019.
  57. High-resolution image synthesis with latent diffusion models. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (CVPR), pages 10684–10695, 2022.
  58. Photorealistic text-to-image diffusion models with deep language understanding. Advances in Neural Information Processing Systems (NeurIPS), 35:36479–36494, 2022.
  59. Object scene representation transformer. Advances in Neural Information Processing Systems (NeurIPS), 35:9512–9524, 2022a.
  60. Scene representation transformer: Geometry-free novel view synthesis through set-latent scene representations. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 6229–6238, 2022b.
  61. Zeronvs: Zero-shot 360-degree view synthesis from a single real image. arXiv preprint arXiv:2310.17994, 2023.
  62. Laion-5b: An open large-scale dataset for training next generation image-text models. Advances in Neural Information Processing Systems (NeurIPS), 35:25278–25294, 2022.
  63. Mvdream: Multi-view diffusion for 3d generation. arXiv preprint arXiv:2308.16512, 2023.
  64. Make-a-video: Text-to-video generation without text-video data. In The Eleventh International Conference on Learning Representations (ICLR), 2022.
  65. Scene representation networks: Continuous 3d-structure-aware neural scene representations. Advances in Neural Information Processing Systems (NeurIPS), 32, 2019.
  66. Light field networks: Neural scene representations with single-evaluation rendering. Advances in Neural Information Processing Systems (NeurIPS), 34:19313–19325, 2021.
  67. Deep unsupervised learning using nonequilibrium thermodynamics. In International conference on machine learning (ICML), pages 2256–2265. PMLR, 2015.
  68. Light field neural rendering. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 8269–8279, 2022.
  69. Viewset diffusion: (0-)image-conditioned 3D generative models from 2D data. In Proceedings of the International Conference on Computer Vision (ICCV), 2023.
  70. Make-it-3d: High-fidelity 3d creation from a single image with diffusion prior. arXiv preprint arXiv:2303.14184, 2023.
  71. Multi-view 3d models from single images with a convolutional network. In Computer Vision–ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, October 11–14, 2016, Proceedings, Part VII 14, pages 322–337. Springer, 2016.
  72. Consistent view synthesis with pose-guided diffusion models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 16773–16783, 2023.
  73. Conditional image generation with pixelcnn decoders. Advances in neural information processing systems (NeurIPS), 29, 2016.
  74. Pixel recurrent neural networks. In International conference on machine learning (ICML), pages 1747–1756. PMLR, 2016.
  75. Neural discrete representation learning. Advances in neural information processing systems (NeurIPS), 30, 2017.
  76. Novel view synthesis with diffusion models. In The Eleventh International Conference on Learning Representations (ICLR), 2023.
  77. Consistent123: Improve consistency for one image to 3d object synthesis. arXiv preprint arXiv:2310.08092, 2023.
  78. Synsin: End-to-end view synthesis from a single image. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 7467–7477, 2020.
  79. Uncovering the disentanglement capability in text-to-image diffusion models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 1900–1910, 2023a.
  80. Lamp: Learn a motion pattern for few-shot-based video generation. arXiv preprint arXiv:2310.10769, 2023b.
  81. Omniobject3d: Large-vocabulary 3d object dataset for realistic perception, reconstruction and generation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 803–814, 2023c.
  82. Light field diffusion for single-view novel view synthesis. arXiv preprint arXiv:2309.11525, 2023.
  83. Consistnet: Enforcing 3d consistency for multi-view images diffusion. arXiv preprint arXiv:2310.10343, 2023.
  84. Consistent-1-to-3: Consistent image to 3d view synthesis via geometry-aware diffusion models. In Proceedings of the International Conference on 3D Vision (3DV), 2024.
  85. pixelnerf: Neural radiance fields from one or few images. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 4578–4587, 2021.
  86. Magvit: Masked generative video transformer. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 10459–10469, 2023.
  87. Nerf++: Analyzing and improving neural radiance fields. arXiv preprint arXiv:2010.07492, 2020.
  88. The unreasonable effectiveness of deep features as a perceptual metric. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 586–595, 2018.
  89. Movq: Modulating quantized vectors for high-fidelity image generation. Advances in Neural Information Processing Systems (NeurIPS), 35:23412–23425, 2022.
  90. Sparsefusion: Distilling view-conditioned diffusion for 3d reconstruction. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 12588–12597, 2023.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (2)
  1. Chuanxia Zheng (32 papers)
  2. Andrea Vedaldi (195 papers)
Citations (30)

Summary

Understanding Free3D: Novel Approach for View Synthesis

Novel View Synthesis and Traditional Challenges

Creating new viewpoints of an object from a single image, known as Novel View Synthesis (NVS), has been a challenging problem in computer vision. Traditionally, to achieve high-quality NVS, models have relied on explicit 3D representations, which are often computationally intensive and not generalizable to new, unseen data. Moreover, existing models often struggle with maintaining accuracy and consistency across multiple generated viewpoints.

Introducing Free3D

In a recent development, researchers at the University of Oxford have introduced Free3D, an innovative approach that promises to synthesize consistent novel views without leaning on 3D models. This new method paves the way for synthesizing consistent 360 videos accurately and efficiently.

The Mechanisms Behind Free3D

The core strength of Free3D lies in its ability to encode better viewing direction information for each pixel through a novel component called the Ray Conditioning Normalization (RCN) layer. This layer informs the 2D image generator of the specific viewing direction for each pixel, thereby directly injecting pose information into the network. Additionally, the method benefits from a lightweight multi-view attention layer and multi-view noise sharing, further enhancing multi-view consistency without traditional 3D representations.

Free3D is trained on a single dataset, Objaverse, yet it generalizes exceptionally well across different datasets, including OmniObject3D and GSO. This has been evidenced through extensive benchmarks that show Free3D outperforming the state-of-the-art models in generated view accuracy and consistency.

The Contributions of Free3D

Free3D has made several noteworthy contributions to the field of NVS:

  • Precise Camera Pose Representation: By modifying pre-trained 2D generative models with ray conditioning normalization, Free3D ensures camera poses are represented accurately, enhancing the quality of novel viewpoints.
  • Enhanced Multi-view Consistency: Through multi-view attention mechanisms and noise sharing, Free3D maintains a geometric and visual consistency across different views of the same object.
  • Superior Generalization: Free3D excels in adapting to entirely new datasets, demonstrating an outstanding ability to generalize beyond a single object or dataset category.
  • Efficiency and Simplicity: As a 3D-free method, Free3D simplifies the traditionally complex NVS process while maintaining, and in many cases improving upon, the quality of 3D-model-based alternatives.

Free3D establishes a new baseline for open-ended single-image novel view synthesis. With its ability to synthesize high-quality views without an explicit 3D model, it has the potential to open new doors in the application of NVS across various domains, including virtual reality, gaming, and 3D content creation. The simplicity, efficiency, and effective generalization capabilities of Free3D ensure its significance in advancing future research and applications in NVS.

X Twitter Logo Streamline Icon: https://streamlinehq.com