P2I-NET: Mapping Camera Pose to Image via Adversarial Learning for New View Synthesis in Real Indoor Environments (2309.15526v1)
Abstract: Given a new $6DoF$ camera pose in an indoor environment, we study the challenging problem of predicting the view from that pose based on a set of reference RGBD views. Existing explicit or implicit 3D geometry construction methods are computationally expensive while those based on learning have predominantly focused on isolated views of object categories with regular geometric structure. Differing from the traditional \textit{render-inpaint} approach to new view synthesis in the real indoor environment, we propose a conditional generative adversarial neural network (P2I-NET) to directly predict the new view from the given pose. P2I-NET learns the conditional distribution of the images of the environment for establishing the correspondence between the camera pose and its view of the environment, and achieves this through a number of innovative designs in its architecture and training lost function. Two auxiliary discriminator constraints are introduced for enforcing the consistency between the pose of the generated image and that of the corresponding real world image in both the latent feature space and the real world pose space. Additionally a deep convolutional neural network (CNN) is introduced to further reinforce this consistency in the pixel space. We have performed extensive new view synthesis experiments on real indoor datasets. Results show that P2I-NET has superior performance against a number of NeRF based strong baseline models. In particular, we show that P2I-NET is 40 to 100 times faster than these competitor techniques while synthesising similar quality images. Furthermore, we contribute a new publicly available indoor environment dataset containing 22 high resolution RGBD videos where each frame also has accurate camera pose parameters.
- 2018. Neural scene representation and rendering. Science 360, 6394 (2018), 1204–1210.
- Geometric image synthesis. In Computer Vision–ACCV 2018: 14th Asian Conference on Computer Vision, Perth, Australia, December 2–6, 2018, Revised Selected Papers, Part VI 14. Springer, 85–100.
- Mip-nerf: A multiscale representation for anti-aliasing neural radiance fields. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 5855–5864.
- Deep local shapes: Learning local sdf priors for detailed 3d reconstruction. In Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXIX 16. Springer, 608–625.
- Brian Curless and Marc Levoy. 1996. A volumetric method for building complex models from range images. In Proceedings of the 23rd annual conference on Computer graphics and interactive techniques. 303–312.
- Local deep implicit functions for 3d shape. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 4857–4866.
- R. Gonzalez. 2002. Woods RE: Digital Image Processing. upper saddle river nj pearson/prentice hall (2002).
- Generative adversarial networks. Commun. ACM 63, 11 (2020), 139–144.
- Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition. 770–778.
- Learning a neural 3d texture space from 2d exemplars. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 8356–8364.
- Disentangled representation learning generative adversarial network for pose-invariant face recognition. US Patent App. 16/648,202.
- Occupancy networks: Learning 3d reconstruction in function space. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 4460–4470.
- Nerf: Representing scenes as neural radiance fields for view synthesis. Commun. ACM 65, 1 (2021), 99–106.
- Spectral normalization for generative adversarial networks. arXiv preprint arXiv:1802.05957 (2018).
- Instant neural radiance fields. In ACM SIGGRAPH 2022 Real-Time Live! 1–2.
- Instant Neural Graphics Primitives with a Multiresolution Hash Encoding. arXiv e-prints (2022).
- Hologan: Unsupervised learning of 3d representations from natural images. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 7588–7597.
- A. Noguchi and T. Harada. 2019. RGBD-GAN: Unsupervised 3D Representation Learning From Natural Image Datasets via RGBD Image Synthesis.
- Perspectivenet: A scene-consistent image generator for new view synthesis in real indoor environments. Advances in Neural Information Processing Systems 32 (2019).
- Texture fields: Learning texture representations in function space. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 4531–4540.
- Unified neural encoding of BTFs. In Computer Graphics Forum, Vol. 39. Wiley Online Library, 167–178.
- Neural BTF compression and interpolation. In Computer Graphics Forum, Vol. 38. Wiley Online Library, 235–244.
- Global illumination with radiance regression functions. ACM Trans. Graph. 32, 4 (2013), 130–1.
- High-resolution image synthesis with latent diffusion models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 10684–10695.
- FaceID-GAN: Learning a Symmetry Three-Player GAN for Identity-Preserving Face Synthesis. IEEE (2018).
- Scene coordinate regression forests for camera relocalization in RGB-D images. In Proceedings of the IEEE conference on computer vision and pattern recognition. 2930–2937.
- Karen Simonyan and Andrew Zisserman. 2014. Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556 (2014).
- DeepVoxels: Learning Persistent 3D Feature Embeddings. (2018).
- Nerfstudio: A Modular Framework for Neural Radiance Field Development. arXiv preprint arXiv:2302.04264 (2023).
- Xiaolong Wang and Abhinav Gupta. 2016. Generative image modeling using style and structure adversarial networks. In Computer Vision–ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, October 11–14, 2016, Proceedings, Part IV 14. Springer, 318–335.
- Lvmin Zhang and Maneesh Agrawala. 2023. Adding conditional control to text-to-image diffusion models. arXiv preprint arXiv:2302.05543 (2023).
- NeRFactor: Neural Factorization of Shape and Reflectance Under an Unknown Illumination. (2021).
- Visual object networks: Image generation with disentangled 3D representations. Advances in neural information processing systems 31 (2018).