Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
139 tokens/sec
GPT-4o
47 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Relightful Harmonization: Lighting-aware Portrait Background Replacement (2312.06886v2)

Published 11 Dec 2023 in cs.CV

Abstract: Portrait harmonization aims to composite a subject into a new background, adjusting its lighting and color to ensure harmony with the background scene. Existing harmonization techniques often only focus on adjusting the global color and brightness of the foreground and ignore crucial illumination cues from the background such as apparent lighting direction, leading to unrealistic compositions. We introduce Relightful Harmonization, a lighting-aware diffusion model designed to seamlessly harmonize sophisticated lighting effect for the foreground portrait using any background image. Our approach unfolds in three stages. First, we introduce a lighting representation module that allows our diffusion model to encode lighting information from target image background. Second, we introduce an alignment network that aligns lighting features learned from image background with lighting features learned from panorama environment maps, which is a complete representation for scene illumination. Last, to further boost the photorealism of the proposed method, we introduce a novel data simulation pipeline that generates synthetic training pairs from a diverse range of natural images, which are used to refine the model. Our method outperforms existing benchmarks in visual fidelity and lighting coherence, showing superior generalization in real-world testing scenarios, highlighting its versatility and practicality.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (74)
  1. Cross-image attention for zero-shot appearance transfer. arXiv preprint arXiv:2311.03335, 2023.
  2. Sega: Instructing diffusion using semantic dimensions. arXiv preprint arXiv:2301.12247, 2023.
  3. Instructpix2pix: Learning to follow image editing instructions. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 18392–18402, 2023.
  4. Masactrl: Tuning-free mutual self-attention control for consistent image synthesis and editing. arXiv preprint arXiv:2304.08465, 2023.
  5. Dense pixel-to-pixel harmonization via continuous image representation. IEEE Transactions on Circuits and Systems for Video Technology, pages 1–1, 2023a.
  6. Zero-shot image harmonization with generative model prior, 2023b.
  7. Dovenet: Deep image harmonization via domain verification. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 8394–8403, 2020.
  8. High-resolution image harmonization via collaborative dual transformations. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 18470–18479, 2022.
  9. Diffedit: Diffusion-based semantic image editing with mask guidance. arXiv preprint arXiv:2210.11427, 2022.
  10. Acquiring the reflectance field of a human face. In Proceedings of the 27th annual conference on Computer graphics and interactive techniques, pages 145–156, 2000.
  11. Diffusion models beat gans on image synthesis. Advances in Neural Information Processing Systems, 34:8780–8794, 2021.
  12. Learning an animatable detailed 3d face model from in-the-wild images. ACM Transactions on Graphics (ToG), 40(4):1–13, 2021.
  13. Guiding instruction-based image editing via multimodal large language models. arXiv preprint arXiv:2309.17102, 2023.
  14. Pct-net: Full resolution image harmonization using pixel-wise color transformations. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 5917–5926, 2023.
  15. Image harmonization with transformer. In Proceedings of the IEEE/CVF international conference on computer vision, pages 14870–14879, 2021a.
  16. Intrinsic image harmonization. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 16367–16376, 2021b.
  17. Prompt-to-prompt image editing with cross attention control. arXiv preprint arXiv:2208.01626, 2022.
  18. Classifier-free diffusion guidance. In NeurIPS 2021 Workshop on Deep Generative Models and Downstream Applications, 2021.
  19. Denoising diffusion probabilistic models. Advances in Neural Information Processing Systems, 33:6840–6851, 2020.
  20. Video diffusion models. arXiv:2204.03458, 2022.
  21. An edit friendly ddpm noise space: Inversion and manipulations. arXiv preprint arXiv:2304.06140, 2023.
  22. Ssh: A self-supervised framework for image harmonization. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 4832–4841, 2021.
  23. Denoising diffusion restoration models. In Advances in Neural Information Processing Systems, 2022a.
  24. Enhancing diffusion-based image synthesis with robust classifier guidance. arXiv preprint arXiv:2208.08664, 2022b.
  25. Jpeg artifact correction using denoising diffusion restoration models. arXiv preprint arXiv:2209.11888, 2022c.
  26. Harmonizer: Learning to perform white-box image and video harmonization. In European Conference on Computer Vision (ECCV), 2022.
  27. Diffusion-based image translation using disentangled style and content representation. arXiv preprint arXiv:2209.15264, 2022.
  28. Srdiff: Single image super-resolution with diffusion probabilistic models. Neurocomputing, 479:47–59, 2022.
  29. Spatial-separated curve rendering network for efficient and high-resolution image harmonization. arXiv preprint arXiv:2109.05750, 2021.
  30. Painterly image harmonization using diffusion model. In Proceedings of the 31st ACM International Conference on Multimedia. ACM, 2023a.
  31. Tf-icon: Diffusion-based training-free cross-domain image composition. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 2294–2305, 2023b.
  32. Lightpainter: Interactive portrait relighting with freehand scribble. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 195–205, 2023.
  33. Sdedit: Image synthesis and editing with stochastic differential equations. arXiv preprint arXiv:2108.01073, 2021.
  34. Negative-prompt inversion: Fast image inversion for editing with text-guided diffusion models. arXiv preprint arXiv:2305.16807, 2023.
  35. Null-text inversion for editing real images using guided diffusion models. arXiv preprint arXiv:2211.09794, 2022.
  36. Learning physics-guided face relighting under directional light. In CVPR, pages 5124–5133, 2020.
  37. Glide: Towards photorealistic image generation and editing with text-guided diffusion models. arXiv preprint arXiv:2112.10741, 2021.
  38. The blessing of randomness: Sde beats ode in general diffusion-based image editing. arXiv preprint arXiv:2311.01410, 2023.
  39. Deep image harmonization with learnable augmentation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 7482–7491, 2023.
  40. Total relighting: learning to relight portraits for background replacement. ACM Transactions on Graphics (TOG), 40(4):1–21, 2021.
  41. Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems, 32, 2019.
  42. Difareli: Diffusion face relighting. 2023.
  43. Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:2204.06125, 2022.
  44. Multiscale structure guided diffusion for image deblurring. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 10721–10733, 2023.
  45. High-resolution image synthesis with latent diffusion models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 10684–10695, 2022.
  46. Palette: Image-to-image diffusion models. In ACM SIGGRAPH 2022 Conference Proceedings, pages 1–10, 2022a.
  47. Photorealistic text-to-image diffusion models with deep language understanding. arXiv preprint arXiv:2205.11487, 2022b.
  48. Unit-ddpm: Unpaired image translation with denoising diffusion probabilistic models. arXiv preprint arXiv:2104.05358, 2021.
  49. Deep unsupervised learning using nonequilibrium thermodynamics. In International Conference on Machine Learning, pages 2256–2265. PMLR, 2015.
  50. Denoising diffusion implicit models. In International Conference on Learning Representations, 2021.
  51. Single image portrait relighting. ACM Transactions on Graphics (TOG), 38(4):1–12, 2019.
  52. Deep image harmonization. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 3789–3797, 2017.
  53. Plug-and-play diffusion features for text-driven image-to-image translation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 1921–1930, 2023.
  54. Interactive portrait harmonization. arXiv preprint arXiv:2203.08216, 2022.
  55. Edict: Exact diffusion inversion via coupled transformations. arXiv preprint arXiv:2211.12446, 2022.
  56. Semi-supervised parametric real-world image harmonization. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 5927–5936, 2023a.
  57. Pretraining is all you need for image-to-image translation. arXiv preprint arXiv:2205.12952, 2022.
  58. Harmonized portrait-background image composition. In Computer Graphics Forum, page e14921. Wiley Online Library, 2023b.
  59. Single image portrait relighting via explicit multiple reflectance channel modeling. ACM TOG, 39(6):1–13, 2020.
  60. Chen Henry Wu and Fernando De la Torre. A latent space of stochastic diffusion models for zero-shot image editing and guidance. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 7378–7387, 2023.
  61. Smartbrush: Text and shape guided object inpainting with diffusion model. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 22428–22437, 2023.
  62. Understanding and improving the realism of image composites. ACM Transactions on graphics (TOG), 31(4):1–10, 2012.
  63. Learning to relight portrait images via a virtual light stage and synthetic-to-real adaptation. ACM TOG, 2022.
  64. Mask guided matting via progressive refinement network. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 1154–1163, 2021.
  65. Deep image compositing. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), pages 365–374, 2021a.
  66. Neural video portrait relighting in real-time via consistency modeling. In ICCV, pages 802–812, 2021b.
  67. Adding conditional control to text-to-image diffusion models. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 3836–3847, 2023a.
  68. The unreasonable effectiveness of deep features as a perceptual metric. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2018.
  69. Portrait shadow manipulation. ACM Transactions on Graphics (TOG), 39(4):78–1, 2020.
  70. Neural light transport for relighting and view synthesis. ACM TOG, 40(1):1–17, 2021c.
  71. Sine: Single image editing with text-to-image diffusion models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 6027–6037, 2023b.
  72. Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Advances in Neural Information Processing Systems, 35:3609–3623, 2022.
  73. Deep single-image portrait relighting. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 7194–7202, 2019.
  74. Learning a discriminative model for the perception of realism in composite images. In Proceedings of the IEEE International Conference on Computer Vision, pages 3943–3951, 2015.
Citations (7)

Summary

  • The paper integrates explicit lighting conditioning into a diffusion model to capture environmental lighting and enhance realistic foreground-background composite quality.
  • It employs a novel alignment network to calibrate lighting features from background images and environment maps, ensuring improved photorealism.
  • The method leverages a synthesized data pipeline for finetuning, outperforming existing techniques on metrics like MSE, PSNR, SSIM, and LPIPS.

Relightful Harmonization: Lighting-aware Portrait Background Replacement

The paper "Relightful Harmonization: Lighting-aware Portrait Background Replacement" presents a novel approach to compositing foreground subjects into new background images while maintaining realistic harmonization in terms of lighting and color. This technique aims to improve upon the limitations of existing harmonization and relighting methods by incorporating sophisticated lighting effects, ensuring visual fidelity and coherence in the final composite images. The proposed method leverages a lighting-aware diffusion model framework bolstered by novel training and alignment techniques, ultimately demonstrating superior performance across various testing scenarios.

Methodology Overview

The authors' methodology can be categorized into three primary stages:

  1. Lighting-aware Diffusion Training:
    • The first stage integrates a lighting representation module within a pre-trained diffusion model. This enables the model to encode lighting information from the background image.
    • The model is trained using a pairwise light stage dataset, designed specifically for relighting applications. This dataset includes images of subjects under various lighting conditions and their corresponding environment maps.
  2. Lighting Representation Alignment:
    • In this stage, the model enhances the physical plausibility of the lighting by aligning the learned lighting representation from background images with that derived from environment maps.
    • An additional alignment network calibrates the background-extracted lighting features to match those extracted from the environment maps. This process helps to ensure more accurate and realistic lighting effects.
  3. Finetuning for Photorealism:
    • The final stage focuses on improving the photorealism of the model's output. A novel data synthesis pipeline is introduced to generate high-quality training pairs from natural images.
    • The model is finetuned using this expanded dataset, allowing it to generalize better to real-world scenarios and improving its ability to produce visually coherent harmonized images.

Contributions and Numerical Results

The paper's primary contributions include:

  • Integrating explicit lighting conditioning into a pre-trained diffusion model, enabling it to capture and utilize spatial lighting information from the background.
  • Introducing an innovative alignment network to enhance the learned lighting representations' physical plausibility by aligning them with environment map-derived features.
  • Developing a data synthesis pipeline that generates realistic training pairs from natural images, allowing the model to be finetuned for improved photorealism.

The proposed method demonstrates significant improvements over existing benchmarks in multiple metrics, including MSE, SSIM, PSNR, and LPIPS. Specifically, the model achieves:

  • On the light stage test set: MSE of 0.012, PSNR of 20.527, SSIM of 0.848, and LPIPS of 0.159
  • On the natural image test set: MSE of 0.005, PSNR of 23.562, SSIM of 0.913, and LPIPS of 0.097

These results highlight the method's ability to deliver more accurate and visually coherent harmonized images compared to existing harmonization and relighting techniques.

Implications and Future Directions

Practical Implications:

This approach has substantial practical implications for various applications in photography, virtual reality, and creative image editing. It enables users to seamlessly composite subjects into diverse backgrounds with consistent and realistic lighting and color adjustments. The method's independence from environment maps in the final inference stage enhances its practicality in casual photography settings and broader applicability in real-world scenarios.

Theoretical Implications:

The introduction of lighting representation alignment between background and environment map-derived features presents a novel way to bridge the gap between imprecise real-world data and structured training datasets. This technique could inspire further research into aligning other types of learned representations to improve model performance and generalizability.

Future Developments:

Future research could explore higher resolution model training to address the current resolution limitation and enhance the preservation of fine details in the subjects. Additionally, integrating intermediate steps such as albedo estimation could further refine complex lighting scenarios, potentially extending this method to more intricate compositional tasks. Extending the framework to handle dynamic and interactive lighting conditions in video sequences could also be a promising area of future exploration.

Conclusion

The paper introduces a robust and versatile framework for lighting-aware portrait background replacement. By combining a lighting-aware diffusion model with a novel lighting representation alignment technique and a comprehensive data synthesis pipeline, the authors demonstrate significant advancements in both the accuracy and realism of harmonized images. This research provides a solid foundation for further development and application of advanced harmonization techniques in both academic and practical fields.

Youtube Logo Streamline Icon: https://streamlinehq.com