- The paper introduces LightLab, a method that fine-tunes diffusion models for precise, parametric control over light sources in single images.
- It leverages a data-centric approach by combining a small set of real image pairs with synthetic renderings to learn accurate light transport and illumination effects.
- Experimental results, including quantitative metrics and user studies, demonstrate that LightLab outperforms other diffusion-based techniques in realistic lighting modifications.
Here is a summary of the paper "LightLab: Controlling Light Sources in Images with Diffusion Models" (2505.09608):
The paper presents LightLab, a method for precise, parametric control over light sources within a single image using diffusion models. Unlike traditional 3D graphics methods that require complex inverse rendering from multiple views or existing diffusion approaches that rely on less precise text-based conditioning, LightLab focuses on learning the intricate relationship between lighting and image appearance directly from paired data.
The core idea is to fine-tune a pre-trained diffusion model on a dataset of images depicting controlled changes in lighting. Recognizing the difficulty of acquiring a large-scale dataset of real image pairs with precise light control, the authors propose a data-centric approach combining a small set of real raw photograph pairs with a larger set of synthetically rendered images.
The real dataset consists of pairs of photographs of the same scene, where the only difference is a visible light source being switched on or off. From these raw pairs, the illumination contribution of the target light source is isolated by subtracting the "off" image from the "on" image (ichange​=ion​−ioff​). A clipping operation is applied to handle noise and calibration errors that might result in negative values.
The synthetic dataset is generated using a physically-based renderer on 3D indoor scenes. Crucially, the rendering pipeline allows rendering the contribution of each light component (ambient and individual light sources) separately in linear RGB space. This disentanglement in the synthetic domain provides a ground truth for light transport behavior.
Leveraging the linearity of light, both real and synthetic datasets are augmented to create numerous training examples. Given the disentangled ambient (iamb​) and target light (ichange​) linear images, new relit images are synthesized by scaling these components and adding them together:
irelit​(α,γ,ct​;iamb​,ichange​)=αiamb​+γichange​c
where α is the relative ambient intensity, γ is the relative target light intensity, and c is a color change coefficient derived from the target color ct​. This process allows generating parametric sequences of images showing varying light intensities and colors.
Handling tone mapping from linear HDR light values to SDR images (required for training diffusion models on typical image data) is addressed by generating training data using two strategies: tone mapping each relit image separately and tone mapping sequences of relit images together using fixed exposures. Both strategies are used during training, and the chosen strategy is provided as a condition to the diffusion model, allowing users to control the output tone mapping at inference time.
A pre-trained latent diffusion model, similar in architecture to Stable Diffusion-XL (Podell et al., 2023), is fine-tuned for the relighting task. The model is conditioned using a combination of spatial and global signals:
- Spatial Conditions: The input image (encoded via VAE), a depth map of the input image, and spatial masks representing the target light source. These masks are scaled by the desired light intensity change (for intensity control) and the target RGB color (for color control). These spatial conditions are resized to match the latent space dimensions, passed through learned convolutions, and concatenated with the input latent noise.
- Global Conditions: Scalar values representing the desired change in ambient light intensity and a binary value indicating the preferred tone mapping strategy ("separate" or "together"). These global controls are encoded using Fourier features and MLPs and inserted via cross-attention layers.
The authors show that training on a mixture of real and synthetic data yields superior results compared to using either domain alone. The real data helps the model generalize to photographic artifacts and styles, preventing domain drift from synthetic data. The synthetic data, with its physically accurate light transport simulations, encourages the model to produce plausible shadows, reflections, and other lighting effects, even when the target light fixture geometry is not explicitly present in the input image. Quantitative metrics (PSNR, SSIM) on paired evaluation datasets and a user paper comparing LightLab against other diffusion-based editing methods (OmniGen (Xiao et al., 17 Sep 2024), RGB ↔ X [2024], ScribbleLight (Choi et al., 26 Nov 2024), IC-Light [anonymous2024ICLight]) demonstrate that LightLab achieves significantly better physical plausibility and user preference for explicit light control.
LightLab enables several practical applications, including adjusting the intensity and color of visible light sources, modifying ambient illumination, performing sequential edits on multiple lights, inserting "virtual" light sources (rendered synthetically without geometry), and generating consistent lighting changes across image sequences for potential animation.
Limitations include a bias towards light source types seen in the training data, potentially leading to inaccurate results for novel light fixtures (e.g., lighting candles as tubes). The data capture process using consumer devices also limits the ability to perform relighting in physically calibrated units.
In conclusion, LightLab demonstrates that leveraging the linearity of light and combining real photographs with physics-based synthetic data is an effective strategy for training diffusion models to achieve fine-grained, parametric control over illumination in single images, resulting in high-quality and physically plausible relighting edits.