RelightNet: Photorealistic Neural Relighting

Updated 2 December 2025

RelightNet is a family of neural networks that perform photorealistic relighting by decomposing scene structure and inferring shadow priors without relying on explicit geometric models.
It leverages methodologies like deep autoencoders with back-projection, transformer-based feature fusion, and wavelet decompositions to efficiently synthesize realistic lighting effects.
RelightNet variants are evaluated with metrics such as PSNR, SSIM, and LPIPS, demonstrating significant improvements in inverse rendering tasks across static images and dynamic human captures.

RelightNet is the designation for a family of neural networks dedicated to the photorealistic relighting of visual content—particularly single images or dynamic full-body human captures—under novel illumination and/or viewpoint conditions. These architectures target inverse rendering tasks by learning to synthesize realistic shadows, shading, and color shifts without explicit scene geometry, reflectance, or dense multi-light calibration. Several major RelightNet variants have been developed and evaluated in recent years, each with distinct design principles, data requirements, and output parameterizations (Wang et al., 2020, Singh et al., 29 Nov 2025, Puthussery et al., 2020).

1. Single-Image Deep RelightNet: Problem and Decomposition

The single-image relighting task aims to synthesize a version of a scene, given only an RGB input $X \in \mathbb{R}^{H \times W \times 3}$ captured under an unknown “any” illuminant $\Phi$ , as if it were lit by a prescribed target source $\Psi$ (Wang et al., 2020). Under a Retinex-inspired decomposition,

$X = L_\Phi(S), \quad Y = L_\Psi(S)$

where $S$ is the underlying light-invariant scene structure and $L_{\Phi,\Psi}(\cdot)$ are lighting operators. The key challenge is to perform this relighting without geometric or albedo priors.

Deep RelightNet addresses this by factoring the mapping into three learned subtasks:

Scene reconversion: Infer a shadow-free proxy $\,\hat{S}\approx S$ by removing input illumination effects.
Shadow (light-effect) prior estimation: Predict the patterns of shading and cast/attached shadows under a new illuminant.
Re-renderer: Fuse the estimated structure and new light effects to output the relit image.

Principal challenges include shadow/anomaly in-painting without geometry, estimating global shadowing patterns for unseen light directions, and producing artifact-free composites (Wang et al., 2020).

2. RelightNet Architectures: Canonical Designs

2.1 Deep Relighting Network (DRN, “Any-to-One”)

DRN (Wang et al., 2020) comprises three modules:

Scene reconversion module: An auto-encoder structured with four Down-sampling Back-Projection (DBP) and Up-sampling Back-Projection (UBP) blocks, nine ResNet-style residual blocks, and a conditional discriminator (for adversarial supervision). The DBP blocks implement

$\bar{Z} = E_1(X), \quad \hat{X} = D_2(\bar{Z}),\quad R_X = X - \hat{X},\quad R_Z = E_2(R_X)\ \hat{Z} = \lambda_2\cdot\bar{Z} + R_Z - \lambda_1 E_2(X)$

(with $\lambda_1=\lambda_2=1$ ).

Shadow prior estimation module: Identical backbone, but without skip connections. Incorporates two discriminators: D_global (PatchGAN on RGB) and D_shad (focused on shadow regions; receives the clipped result $\operatorname{clip}(X,\alpha)$ for $\alpha=15/255$ ).
Re-renderer: Concatenates structure and shadow prior features, applies a multi-scale convolution block (kernels 3/5/7/9), Squeeze-and-Excitation (SE) recalibration, and a final painting convolution followed by tanh activation.

2.2 Transformer-Based RelightNet for Dynamic Humans

In “Relightable Holoported Characters,” a different RelightNet predicts full-body human appearance from multi-view input under arbitrary HDR lighting in a single network pass (Singh et al., 29 Nov 2025). Key features:

Physics-informed feature stack: For each mesh texel $u$ in UV domain $\Omega$ , the input is $f(u) = [\tilde n(u), \hat n(u), p(u), \hat{\rho}(u), d(u), \gamma(u)]$ where features encode geometry (current and history of normals, position), albedo, precomputed shading, and viewing direction.
2D U-Net backbone in UV space: Incorporates up to 18 blocks, with self- and cross-attention (SA/CA) at lower spatial resolutions. Cross-attention fuses texelwise features with a global lighting embedding derived from directional tokens of the target environment map.
Output: 14 parameters per texel, instantiating 3D Gaussian splats attached to the mesh surface for differentiable rendering and continuous animation.

2.3 Wavelet Decomposed RelightNet (WDRN)

WDRN (Puthussery et al., 2020) is a one-to-one image relighting network employing a three-level encoder-decoder architecture built entirely around Haar Discrete Wavelet Transforms (DWT) and Inverse DWT (IDWT). Each downsampling "split" produces LL/LH/HL/HH wavelet bands, while decoding fuses these via IDWT and additive skip-connections. The method achieves a large receptive field without learned striding and robustly synthesizes large-scale luminance gradients.

3. Training Protocols and Losses

RelightNet variants require extensive ground truth data and carefully staged training procedures.

DRN (Any-to-One):

Scene reconversion is trained using L1 supervision and a strong adversarial term weighted $\lambda_{struc}=100$ , leveraging pseudo shadow-free images from multi-exposure fusion.
Shadow prior module targets both pixel-wise (L1) and shadow- or global-discriminative adversarial signals with no skip connections to enforce global shading.
The re-renderer is trained last (freezing previous modules), with L1 and perceptual (VGG) loss, weighted $\lambda_{perc}=0.01$ (Wang et al., 2020).

Transformers for Dynamic Humans:
- Training is staged: animation networks first (on uniform-lit frames), then AlbedoNet on UV unprojections, finally RelightNet on environment-lit frames.
- Supervision uses L1 and VGG-perceptual losses between composited output and ground truth, alongside strong regularizers on the Gaussian splat outputs to encourage stability during warm-up.
- Datasets comprise tens of thousands of frames per subject, covered under >1000 HDR environment maps (Singh et al., 29 Nov 2025).
Wavelet Decomposed RelightNet:
- Loss comprises L1, Structural Similarity (SSIM), and a novel gray loss targeting blurred single-channel gradients, jointly weighted ( $\alpha=\beta=\gamma=1$ ).
- An ablation shows improved performance—especially in shadow realism—with the gray loss (Puthussery et al., 2020).

4. Datasets, Evaluation, and Quantitative Results

RelightNet architectures are evaluated on domain-specific benchmarks:

Method	PSNR (↑)	SSIM (↑)	LPIPS (↓)
DRN (Wang et al., 2020)	17.59	0.596	0.440
WDRN (Puthussery et al., 2020)	17.07	0.6310	0.3405
RelightNet [2512...]	~31.38	0.9000	0.0701

Single-image DRN/WDRN: Benchmarked on the VIDIT dataset (300 train / 45 val / 45 test scenes, 40 light directions/temperatures per scene, images $1024 \times 1024$ ). DRN achieved best PSNR in the AIM2020 any-to-one challenge, while WDRN attains state-of-the-art mean perceptual scores with minimal model size (6.4M params, 0.03s/image).
Dynamic human RelightNet: Evaluated on lightstage datasets where ground-truth per-frame, per-view relighting is available. Outperforms Relighting4D, IntrinsicAvatar, NeuralGaffer, and MeshAvatar with higher PSNR, SSIM, and lower LPIPS in free-viewpoint and novel lighting conditions (Singh et al., 29 Nov 2025).

5. Technical Innovations

Several architectural and methodological advances characterize RelightNet developments:

Modular decomposition: Explicit separation of structure extraction, shadow prior estimation, and rendering.
Back-projection and wavelet blocks: Use of DBP/UBP for information preservation in DRN, and exact Haar DWT for receptive field expansion in WDRN.
Adversarial objectives focused on shadow regions: Tailored discriminators for shadow and global photorealism.
Cross-attention for lighting: Transformer-based fusion of per-surface features with lighting tokens directly mirrors the rendering integral and allows generalization to arbitrary lighting in a single forward pass (Singh et al., 29 Nov 2025).
Continuous 3D outputs: Gaussian splat parameterization enables seamless integration of relighting and animation pipelines for dynamic content.

6. Limitations and Research Directions

Scalability and generalization: Canonical DRN and WDRN forms target fixed source-target relighting. Generalization to arbitrary “any-to-any” lighting or across scene types would require explicit encoding of the target illuminant or inclusion of reference guidance. WDRN, for example, does not inpaint newly exposed regions following dramatic light shifts, and strong shadows may be insufficiently recast absent geometric priors (Puthussery et al., 2020).
Data acquisition cost: Transformer-based RelightNet for human performance depends on dense annotation from lightstage setups (>1000 HDR mappings per subject), posing bottlenecks for broader deployment (Singh et al., 29 Nov 2025).
Physical plausibility: Absence of explicit normal, reflectance, or geometry modeling can lead to subtle artifacts—e.g., ghosting, incorrect highlight attribution—especially under extreme or synthetic lighting.

Ongoing work aims to integrate explicit shape or normal estimation, leverage additional illumination cues, and extend RelightNet architectures to “any-to-any” relighting with sparse reference signals or lighting condition encoding.

7. Comparative Summary and Impact

RelightNet architectures represent a substantial advance over both direct pixel-domain translation and analytic inverse lighting approaches by tightly integrating domain-inspired network structure, modular loss functions, and—where feasible—cross-modal data representation (e.g., mesh UV maps for humans). These methods are effective across scenarios ranging from single-image relighting (Wang et al., 2020) and 2D image translation via wavelets (Puthussery et al., 2020) to real-time, fully dynamic, and relit human rendering in photorealistic telepresence and free-viewpoint applications (Singh et al., 29 Nov 2025). The generalized “RelightNet” designation now references a set of neural relighting approaches that foreground efficiency, generalization, and photometric plausibility under diverse source and target conditions.