Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
153 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
45 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Intrinsic Image Diffusion for Indoor Single-view Material Estimation (2312.12274v2)

Published 19 Dec 2023 in cs.CV, cs.AI, and cs.GR

Abstract: We present Intrinsic Image Diffusion, a generative model for appearance decomposition of indoor scenes. Given a single input view, we sample multiple possible material explanations represented as albedo, roughness, and metallic maps. Appearance decomposition poses a considerable challenge in computer vision due to the inherent ambiguity between lighting and material properties and the lack of real datasets. To address this issue, we advocate for a probabilistic formulation, where instead of attempting to directly predict the true material properties, we employ a conditional generative model to sample from the solution space. Furthermore, we show that utilizing the strong learned prior of recent diffusion models trained on large-scale real-world images can be adapted to material estimation and highly improves the generalization to real images. Our method produces significantly sharper, more consistent, and more detailed materials, outperforming state-of-the-art methods by $1.5dB$ on PSNR and by $45\%$ better FID score on albedo prediction. We demonstrate the effectiveness of our approach through experiments on both synthetic and real-world datasets.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (43)
  1. Inverse path tracing for joint material and lighting estimation. In IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2019, Long Beach, CA, USA, June 16-20, 2019, pages 2447–2456. Computer Vision Foundation / IEEE, 2019.
  2. Intrinsic scene properties from a single RGB-D image. In 2013 IEEE Conference on Computer Vision and Pattern Recognition, Portland, OR, USA, June 23-28, 2013, pages 17–24. IEEE Computer Society, 2013.
  3. Shape, illumination, and reflectance from shading. IEEE Trans. Pattern Anal. Mach. Intell., 37(8):1670–1687, 2015.
  4. Intrinsic images in the wild. ACM Trans. Graph., 33(4):159:1–159:12, 2014.
  5. A simple model for intrinsic image decomposition with depth cues. In IEEE International Conference on Computer Vision, ICCV 2013, Sydney, Australia, December 1-8, 2013, pages 241–248. IEEE Computer Society, 2013.
  6. MAIR: multi-view attention inverse rendering with 3d spatially-varying lighting estimation. CoRR, abs/2303.12368, 2023.
  7. Omnidata: A scalable pipeline for making multi-task mid-level vision datasets from 3d scans. In 2021 IEEE/CVF International Conference on Computer Vision, ICCV 2021, Montreal, QC, Canada, October 10-17, 2021, pages 10766–10776. IEEE, 2021.
  8. Intrinsic images by clustering. In Computer graphics forum, pages 1415–1424. Wiley Online Library, 2012.
  9. Ground truth dataset and baseline evaluations for intrinsic image algorithms. In IEEE 12th International Conference on Computer Vision, ICCV 2009, Kyoto, Japan, September 27 - October 4, 2009, pages 2335–2342. IEEE Computer Society, 2009.
  10. Denoising diffusion probabilistic models. In Advances in Neural Information Processing Systems 33: Annual Conference on Neural Information Processing Systems 2020, NeurIPS 2020, December 6-12, 2020, virtual, 2020.
  11. Openclip, 2021. If you use this software, please cite it as below.
  12. Image-to-image translation with conditional adversarial networks. CVPR, 2017.
  13. Adam: A method for stochastic optimization. In 3rd International Conference on Learning Representations, ICLR 2015, San Diego, CA, USA, May 7-9, 2015, Conference Track Proceedings, 2015.
  14. Variational diffusion models. CoRR, abs/2107.00630, 2021.
  15. Shading annotations in the wild. In 2017 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2017, Honolulu, HI, USA, July 21-26, 2017, pages 850–859. IEEE Computer Society, 2017.
  16. Lightness and retinex theory. Journal of the Optical Society of America, 61:1–11, 1971.
  17. Cgintrinsics: Better intrinsic image decomposition through physically-based rendering. In Computer Vision - ECCV 2018 - 15th European Conference, Munich, Germany, September 8-14, 2018, Proceedings, Part III, pages 381–399. Springer, 2018a.
  18. Learning intrinsic image decomposition from watching the world. In 2018 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2018, Salt Lake City, UT, USA, June 18-22, 2018, pages 9039–9048. Computer Vision Foundation / IEEE Computer Society, 2018b.
  19. Learning to reconstruct shape and spatially-varying reflectance from a single image. ACM Trans. Graph., 37(6):269, 2018.
  20. Inverse rendering for complex indoor scenes: Shape, spatially-varying lighting and SVBRDF from a single image. In 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2020, Seattle, WA, USA, June 13-19, 2020, pages 2472–2481. Computer Vision Foundation / IEEE, 2020.
  21. Physically-based editing of indoor scene lighting from a single image. In Computer Vision - ECCV 2022 - 17th European Conference, Tel Aviv, Israel, October 23-27, 2022, Proceedings, Part VI, pages 555–572. Springer, 2022.
  22. Zero-1-to-3: Zero-shot one image to 3d object. CoRR, abs/2303.11328, 2023.
  23. Decoupled weight decay regularization. In 7th International Conference on Learning Representations, ICLR 2019, New Orleans, LA, USA, May 6-9, 2019. OpenReview.net, 2019.
  24. Material and lighting reconstruction for complex indoor scenes with texture-space differentiable rendering. In 32nd Eurographics Symposium on Rendering, EGSR 2021 - Digital Library Only Track, Saarbrücken, Germany, June 29 - July 2, 2021, pages 73–84. Eurographics Association, 2021.
  25. Free-viewpoint indoor neural relighting from multi-view stereo. ACM Trans. Graph., 40(5):194:1–194:18, 2021.
  26. State of the art on diffusion models for visual computing. CoRR, abs/2310.07204, 2023.
  27. Learning transferable visual models from natural language supervision. In ICML, 2021.
  28. Umat: Uncertainty-aware single image high resolution material capture. In IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2023, Vancouver, BC, Canada, June 17-24, 2023, pages 5764–5774. IEEE, 2023.
  29. High-resolution image synthesis with latent diffusion models. CoRR, abs/2112.10752, 2021.
  30. U-net: Convolutional networks for biomedical image segmentation. In Medical Image Computing and Computer-Assisted Intervention - MICCAI 2015 - 18th International Conference Munich, Germany, October 5 - 9, 2015, Proceedings, Part III, pages 234–241. Springer, 2015.
  31. LAION-5B: an open large-scale dataset for training next generation image-text models. In NeurIPS, 2022.
  32. Deep unsupervised learning using nonequilibrium thermodynamics. In Proceedings of the 32nd International Conference on Machine Learning, ICML 2015, Lille, France, 6-11 July 2015, pages 2256–2265. JMLR.org, 2015.
  33. Denoising diffusion implicit models. arXiv:2010.02502, 2020a.
  34. Score-based generative modeling through stochastic differential equations. CoRR, abs/2011.13456, 2020b.
  35. Microfacet models for refraction through rough surfaces. In Proceedings of the Eurographics Symposium on Rendering Techniques, Grenoble, France, 2007, pages 195–206. Eurographics Association, 2007.
  36. Learning indoor inverse rendering with 3d spatially-varying lighting. In 2021 IEEE/CVF International Conference on Computer Vision, ICCV 2021, Montreal, QC, Canada, October 10-17, 2021, pages 12518–12527. IEEE, 2021.
  37. Measured albedo in the wild: Filling the gap in intrinsics evaluation. COPR, 2023.
  38. Diffusion models: A comprehensive survey of methods and applications. CoRR, abs/2209.00796, 2022.
  39. Scannet++: A high-fidelity dataset of 3d indoor scenes. In Proceedings of the International Conference on Computer Vision (ICCV), 2023.
  40. Adding conditional control to text-to-image diffusion models. CoRR, abs/2302.05543, 2023.
  41. Learning-based inverse rendering of complex indoor scenes with differentiable monte carlo raytracing. In SIGGRAPH Asia 2022 Conference Papers, SA 2022, Daegu, Republic of Korea, December 6-9, 2022, pages 6:1–6:8. ACM, 2022a.
  42. I22{}^{\mbox{2}}start_FLOATSUPERSCRIPT 2 end_FLOATSUPERSCRIPT-sdf: Intrinsic indoor scene reconstruction and editing via raytracing in neural sdfs. CoRR, abs/2303.07634, 2023.
  43. Irisformer: Dense vision transformers for single-image inverse rendering in indoor scenes. In IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2022, New Orleans, LA, USA, June 18-24, 2022, pages 2812–2821. IEEE, 2022b.
Citations (7)

Summary

  • The paper introduces a probabilistic generative model that diffuses intrinsic images to estimate material properties with high fidelity.
  • It leverages a BRDF representation and over 50,000 rendered images to provide detailed and consistent predictions for indoor scenes.
  • Quantitative evaluations show superior performance with improved PSNR, SSIM, LPIPS, and FID scores compared to existing methods.

Intrinsic Image Diffusion for Material Estimation

Introduction to Appearance Decomposition

Appearance decomposition is a critical but challenging area in computer vision. It involves separating an image into its fundamental components: material properties and lighting. This process is essential for numerous applications, including content editing, virtual reality, and relighting of scenes. The main challenge lies in the fact that visual appearances result from the complex interplay between lighting and material properties, leading to inherent ambiguity in separating these components.

Probabilistic Approach to Estimation

Traditional methods have adopted a deterministic approach, aiming to provide a single solution, which often results in loss of high-frequency details and averaged-out solutions that fail to represent the true complexity of materials. The paper introduces Intrinsic Image Diffusion, a conditional generative model that embraces the probabilistic nature of the appearance decomposition problem. By generating multiple solutions, the model allows for a more comprehensive exploration of the solution space. This method leverages recent diffusion models, which have been pre-trained with large-scale real-world images to better generalize across real and synthetic data.

Material Representation and Dataset

The material properties are represented using a BRDF model which includes albedo, roughness, and metallic properties, fundamentally used in computer graphics. The model was trained using a rendered dataset of over 50000 images with corresponding material maps, providing high-fidelity training data. This dataset, coupled with the model’s ability to adapt the image prior from pre-trained diffusion models, results in predictions that are more detailed, consistent, and faithful to the actual materials when compared to existing approaches.

Methodology and Evaluation

The training pipeline of the Intrinsic Image Diffusion model involves noise prediction based on the input image through a defined diffusion process, and it leverages the known strong prior of pre-trained diffusion models. During inference, the model can sample multiple potential explanations for a single input view, predicting albedo and BRDF features. The paper quantitatively and qualitatively evaluates the new model on both synthetic and real-world datasets, showing that it outperforms state-of-the-art methods, achieving better PSNR, SSIM, LPIPS, and FID scores.

Furthermore, the paper discusses using the model to optimize lighting in indoor scenes, benefitting from the consistent and precise material predictions produced by the model. This optimization process can reproduce detailed and controllable lighting, improving the scene's realism.

Conclusion and Potential

The paper concludes by highlighting the Intrinsic Image Diffusion model's significant advancements in single-view material estimation. By using a probabilistic formulation and tapping into the learned priors of diffusion models, the technique opens up new possibilities for accurate and detailed material estimation. The approach also paves the way for future work, including weak supervision and expanded inverse rendering frameworks, making the field of appearance decomposition richer for new exploration and applications.

X Twitter Logo Streamline Icon: https://streamlinehq.com

Tweets

Youtube Logo Streamline Icon: https://streamlinehq.com