RGB$\leftrightarrow$X: Image decomposition and synthesis using material- and lighting-aware diffusion models (2405.00666v1)
Abstract: The three areas of realistic forward rendering, per-pixel inverse rendering, and generative image synthesis may seem like separate and unrelated sub-fields of graphics and vision. However, recent work has demonstrated improved estimation of per-pixel intrinsic channels (albedo, roughness, metallicity) based on a diffusion architecture; we call this the RGB$\rightarrow$X problem. We further show that the reverse problem of synthesizing realistic images given intrinsic channels, X$\rightarrow$RGB, can also be addressed in a diffusion framework. Focusing on the image domain of interior scenes, we introduce an improved diffusion model for RGB$\rightarrow$X, which also estimates lighting, as well as the first diffusion X$\rightarrow$RGB model capable of synthesizing realistic images from (full or partial) intrinsic channels. Our X$\rightarrow$RGB model explores a middle ground between traditional rendering and generative models: we can specify only certain appearance properties that should be followed, and give freedom to the model to hallucinate a plausible version of the rest. This flexibility makes it possible to use a mix of heterogeneous training datasets, which differ in the available channels. We use multiple existing datasets and extend them with our own synthetic and real data, resulting in a model capable of extracting scene properties better than previous work and of generating highly realistic images of interior scenes.
- Recovering intrinsic scene characteristics. Comput. vis. syst 2, 3-26 (1978), 2.
- Intrinsic images in the wild. ACM Transactions on Graphics (TOG) 33, 4 (2014), 1–12.
- Stylegan knows normal, depth, albedo, and more. Advances in Neural Information Processing Systems 36 (2024).
- MiDaS v3.1 – A Model Zoo for Robust Monocular Relative Depth Estimation. arXiv preprint arXiv:2307.14460 (2023).
- InstructPix2Pix: Learning to Follow Image Editing Instructions. In CVPR.
- Chris Careaga and Yağız Aksoy. 2023. Intrinsic Image Decomposition via Ordinal Shading. ACM Trans. Graph. (2023).
- Generative Models: What do they know? Do they know things? Let’s find out! arXiv preprint arXiv:2311.17137 (2023).
- A Survey on Intrinsic Images: Delving Deep into Lambert and Beyond. International Journal of Computer Vision 130, 3 (2022), 836–868.
- Generative adversarial nets. Advances in neural information processing systems 27 (2014).
- OutCast: Single Image Relighting with Cast Shadows. Computer Graphics Forum 43 (2022).
- Ground-truth dataset and baseline evaluations for intrinsic image algorithms. In International Conference on Computer Vision. 2335–2342. https://doi.org/10.1109/ICCV.2009.5459428
- A review on generative adversarial networks: Algorithms, theory, and applications. IEEE transactions on knowledge and data engineering 35, 4 (2021), 3313–3332.
- Jonathan Ho and Tim Salimans. 2022. Classifier-free diffusion guidance. arXiv preprint arXiv:2207.12598 (2022).
- Lora: Low-rank adaptation of large language models. arXiv preprint arXiv:2106.09685 (2021).
- Composer: Creative and controllable image synthesis with composable conditions. arXiv preprint arXiv:2302.09778 (2023).
- Jay-Artist. 2012. Country-Kitchen Cycles. https://blendswap.com/blend/5156
- A style-based generator architecture for generative adversarial networks. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 4401–4410.
- Analyzing and improving the image quality of stylegan. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 8110–8119.
- Intrinsic Image Diffusion for Single-view Material Estimation. In arxiv.
- Edwin H Land and John J McCann. 1971. Lightness and retinex theory. Josa 61, 1 (1971), 1–11.
- Exploiting Diffusion Prior for Generalizable Pixel-Level Semantic Prediction. arXiv preprint arXiv:2311.18832 (2023).
- BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models. arXiv:2301.12597 [cs.CV]
- Inverse rendering for complex indoor scenes: Shape, spatially-varying lighting and svbrdf from a single image. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2475–2484.
- Physically-Based Editing of Indoor Scene Lighting from a Single Image. In ECCV 2022. 555–572.
- OpenRooms: An Open Framework for Photorealistic Indoor Scene Datasets. In 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 7186–7195. https://doi.org/10.1109/CVPR46437.2021.00711
- Common diffusion noise schedules and sample steps are flawed. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision. 5404–5411.
- Ilya Loshchilov and Frank Hutter. 2017. Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017).
- Deep Shading: Convolutional Neural Networks for Screen Space Shading. Comput. Graph. Forum 36, 4 (jul 2017), 65–78.
- NVIDIA. 2020. NVIDIA OptiX™ AI-Accelerated Denoiser. https://developer.nvidia.com/optix-denoiser
- Recent progress on generative adversarial networks (GANs): A survey. IEEE access 7 (2019), 36322–36333.
- Total relighting: learning to relight portraits for background replacement. ACM Trans. Graph. 40, 4 (2021), 43–1.
- Matt Pharr and Greg Humphreys. 2004. Physically Based Rendering: From Theory to Implementation. Morgan Kaufmann Publishers Inc., San Francisco, CA, USA.
- Ariadna Quattoni and Antonio Torralba. 2009. Recognizing indoor scenes. In 2009 IEEE conference on computer vision and pattern recognition. IEEE, 413–420.
- Learning transferable visual models from natural language supervision. In International conference on machine learning. PMLR, 8748–8763.
- Hierarchical Text-Conditional Image Generation with CLIP Latents. arXiv:2204.06125 [cs.CV]
- Towards Robust Monocular Depth Estimation: Mixing Datasets for Zero-Shot Cross-Dataset Transfer. IEEE Transactions on Pattern Analysis and Machine Intelligence 44, 3 (2022).
- Hypersim: A Photorealistic Synthetic Dataset for Holistic Indoor Scene Understanding. In International Conference on Computer Vision (ICCV) 2021.
- High-resolution image synthesis with latent diffusion models. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 10684–10695.
- Nerf for outdoor scene relighting. In European Conference on Computer Vision. Springer, 615–631.
- Tim Salimans and Jonathan Ho. 2022. Progressive distillation for fast sampling of diffusion models. arXiv preprint arXiv:2202.00512 (2022).
- Alchemist: Parametric Control of Material Properties with Diffusion Models. arXiv:2312.02970 [cs.CV]
- Denoising diffusion implicit models. arXiv preprint arXiv:2010.02502 (2020).
- Manu Mathew Thomas and Angus G. Forbes. 2018. Deep Illumination: Approximating Dynamic Global Illumination with Generative Adversarial Network. arXiv:1710.09834 [cs.GR]
- Microfacet models for refraction through rough surfaces (EGSR’07). 195–206.
- Pvt v2: Improved baselines with pyramid vision transformer. Computational Visual Media 8, 3 (2022), 415–424.
- Neural Fields meet Explicit Geometric Representations for Inverse Rendering of Urban Scenes. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
- Self-supervised outdoor scene relighting. In Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXII 16. Springer, 84–101.
- Adding Conditional Control to Text-to-Image Diffusion Models. arXiv:2302.05543 [cs.CV]
- Learning-Based Inverse Rendering of Complex Indoor Scenes with Differentiable Monte Carlo Raytracing. In SIGGRAPH Asia 2022 Conference Papers. ACM, Article 6, 8 pages. https://doi.org/10.1145/3550469.3555407
- IRISformer: Dense Vision Transformers for Single-Image Inverse Rendering in Indoor Scenes. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 2822–2831.