Single-Image Portrait Relighting

Updated 12 March 2026

Single image portrait relighting is the process of synthesizing realistic lighting on a human portrait from an unconstrained input by disentangling intrinsic facial properties and external illumination.
It leverages deep neural networks and architectures such as U-Net, intrinsic decomposition, and 3D-aware models to achieve photorealistic relit outputs with precise control over shadows and highlights.
Applications span photographic editing, AR/VR content, and visual effects, with performance evaluated using perceptual metrics like SSIM, LPIPS, and quantitative error measurements.

Single image portrait relighting is the task of synthesizing the appearance of a human portrait under novel scene illumination, using only a single input image. This process enables realistic manipulation of perceived lighting conditions and is foundational for applications including photographic editing, AR/VR content, visual effects, and computational photography. Unlike classic approaches requiring controlled light stages, detailed 3D reconstructions, or multiple captures under varying conditions, single image relighting seeks to extract, transform, and recombine lighting information directly from unconstrained consumer photos. The technical challenge centers on disentangling intrinsic facial properties (albedo, geometry) from illumination, then re-synthesizing the image under a user-specified lighting environment, typically encoded as a high dynamic range (HDR) environment map.

1. Problem Definition and Physical Formulation

Formally, the input is an RGB portrait $I \in \mathbb{R}^{H \times W \times 3}$ acquired in unknown or arbitrary conditions, optionally with a foreground mask $M$ . The user supplies a target lighting environment $L_\mathrm{target} \in \mathbb{R}^{M\times N\times 3}$ , typically parameterized as a latitude–longitude HDR panorama. The relighting task is to produce an image $R$ of the same subject as if illuminated by $L_\mathrm{target}$ , while optionally estimating the original environment $L_\mathrm{source}$ . The canonical image formation model is

$R(p) = \int_{\omega \in \Omega} L(\omega)\,\rho(p, \omega)\,\max(n(p)\cdot \omega, 0)\,d\omega$

where $\rho$ is the spatially-varying BRDF and $n(p)$ is the surface normal at pixel $p$ . Rather than explicit inversion, recent methods learn a feedforward mapping $(\hat{L}_\mathrm{source},R) = f(I, L_\mathrm{target})$ via deep neural networks trained on paired or synthetic relighting data (Sun et al., 2019).

Practically, relighting methods must handle ambiguities in geometry, reflectance, and lighting, provide spatial control over shadows/specularities, and generalize to "in-the-wild" input with diverse appearances, occlusions, and background content. Success is measured by perceptual realism, geometric consistency, and quantitative similarity to ground-truth relit images.

2. Datasets and Synthetic Data Generation

Access to comprehensive, physically consistent training data is a principal bottleneck for portrait relighting research. High-fidelity datasets typically use light stage apparatus:

Arrays of hundreds of LED sources (OLAT: One-Light-At-a-Time) sample directional lighting; multi-view camera rigs provide multi-angle supervision; per-frame tracking corrects for subtle subject motions (Sun et al., 2019).
Post-processing workflows composite HDR environment lighting by linearly combining OLAT images weighted by projected solid angles or via spherical harmonics.
Known public datasets include Laval Indoor/Outdoor (HDR), PolyHaven (Mei et al., 2024), and recently large-scale, multi-expression sets such as POLAR (220 subjects, 156 lights, 28M images) (Chen et al., 15 Dec 2025) and FaceOLAT (139 subjects, 331 lights, 4K resolution) (Rao et al., 17 Oct 2025).

To bypass hardware constraints, several works synthesize paired data:

Virtual light stage rendering superimposes detailed 3D faces, hair, clothing, and accessories, then renders under randomized panoramic maps using physically based rendering engines (e.g., Arnold, Blender Cycles) (Yeh et al., 2022, Chaturvedi et al., 16 Jan 2025).
Domain gap bridging employs synthetic-to-real adaptation with real portraits pooled for residual learning, GAN-based refinement, or multi-task objectives (Yeh et al., 2022, Chaturvedi et al., 16 Jan 2025).

Synthetic data allows explicit control over all ground-truth factors (albedo, normals, HDR lighting, shadow masks), critical for disentanglement and evaluative benchmarking.

3. Network Architectures and Representational Strategies

Portrait relighting architectures reflect two foundational paradigms: direct image-to-image translation and physically motivated inverse rendering.

Encoder–Decoder and U-Net Variants

Many early and baseline methods adopt U-Net-style encoder–decoders, employing skip connections and foreground masks to preserve facial details and suppress backgrounds (Sun et al., 2019).
Auxiliary prediction heads may regress lighting parameters (e.g., environment maps or SH coefficients) with spatial confidences, facilitating both relit outputs and lighting estimation (Sun et al., 2019).

Intrinsic Decomposition and Inverse Rendering

Physically inspired pipelines decompose input into albedo, normal, and lighting using supervised or self-supervised learning; relighting is explicit via parametric reflectance models such as Lambertian plus spherical harmonics (Zehni et al., 2021).
Self-supervised approaches enforce invariances among multi-illumination pairs or under geometric flipping/rotation augmentations, constraining the lighting code to SH parameter space (degree ≤ 2) (Liu et al., 2020).
Feature disentanglement is enhanced by cross-relighting losses and domain adaptation, enabling robust transfer to real images and diverse lighting (Zehni et al., 2021, Yeh et al., 2022).

3D-Aware and Volumetric Models

EG3D-style 3D GANs and tri-plane representations embed single portrait images into geometry-aware latent spaces, supporting full volumetric rendering, viewpoint change, and relighting (Mei et al., 2024, Rao et al., 2024, Rao et al., 17 Oct 2025).
Dedicated relighting modules process encoded tri-plane features, target HDR environment maps, and (optionally) explicit head-pose inputs (Mei et al., 2024).
OLAT-basis–driven models predict per-light responses in a flow-based or triplane-augmented latent space, supporting physically accurate environmental mixing and interpretable lighting control (Chen et al., 15 Dec 2025, Rao et al., 17 Oct 2025).

Diffusion and Generative Frameworks

Diffusion models (e.g., Stable Diffusion–based) learn the relighting transformation in the latent space, conditioning on input images, environment maps, and text prompts for semantic control (Chaturvedi et al., 16 Jan 2025, Liu et al., 17 Jun 2025, Cha et al., 2024).
Some approaches leverage multi-task objectives to fuse real (unpaired) and synthetic (paired) data, enabling broader generalization and robust appearance preservation (Chaturvedi et al., 16 Jan 2025).

4. Training Procedures, Objectives, and Losses

Training methodologies blend supervised, self-supervised, and adversarial strategies:

Core photometric losses (L1, L2) on predicted vs. ground-truth relit pixels, often weighted within segmented foregrounds (Sun et al., 2019).
Lighting consistency: SH or full env-map prediction losses; flow matching in latent OLAT-generation models (Chen et al., 15 Dec 2025).
Self-supervision: image reconstruction under original or jittered source lighting (to disentangle albedo and illumination), cross-input consistency on multi-lit pairs (Sun et al., 2019, Zehni et al., 2021).
Perceptual (VGG/LPIPS) losses for fine-scale qualitative improvement, especially in generative setups (Mei et al., 2024).
Adversarial losses (PatchGAN) may be used but are not strictly necessary for sharpness when physics-based constraints dominate (Chen et al., 15 Dec 2025).
Explicit geometric and identity preservation: losses in face-embedding space (ArcFace/MagFace), as well as temporal regularization for video (Rao et al., 17 Oct 2025, Yeh et al., 2022).

Typical pipelines combine diverse data augmentations: random cropping, environment map rotations, synthetic shadow/specularity augmentation, and photometric normalization.

5. Relighting Inference, Applications, and User Control

Inference protocols convert novel single images into relit portraits as follows:

Input images are preprocessed via cropping, matting, or face segmentation; optional normalization or color correction adjusts dynamic range (Sun et al., 2019).
The network infers scene attributes (albedo, normals, lighting parameters), performs environment map encoding, and generates the relit output conditioned on the target (Zehni et al., 2021, Mei et al., 2024).
OLAT-driven or tri-plane volumetric models enable explicit environment mixing, 3D consistent relighting, and novel viewpoint rendering (Chen et al., 15 Dec 2025, Rao et al., 17 Oct 2025, Rao et al., 2024).
Diffusion models provide additional controls via classifier-free guidance, trading off identity preservation and relighting strength (Chaturvedi et al., 16 Jan 2025). Text-driven relighting is supported in recent generative architectures (Liu et al., 17 Jun 2025, Cha et al., 2024), including freehand-scribble or parameter-sweep interfaces for intuitive user editing (Mei et al., 2023, Futschik et al., 2023).
Applications extend to complete lighting swaps across portraits, light transfer between images, scene harmonization for compositing, video relighting, and semantic/structural editing.

6. Quantitative Metrics, Limitations, and Comparative Results

Standard evaluation metrics include RMSE, scale-invariant RMSE, SSIM, DSSIM, PSNR, LPIPS, and face-ID similarity (cosine in embedding space) (Sun et al., 2019, Mei et al., 2024, Cha et al., 2024, Chen et al., 15 Dec 2025). State-of-the-art models on held-out test sets report (example values):

RMSE ≈ 0.04–0.18 (scale-dependent), DSSIM ≈ 0.025, LPIPS down to 0.09–0.12, SSIM up to 0.85 (Sun et al., 2019, Mei et al., 2024, Chen et al., 15 Dec 2025).
Inference speeds: 160 ms for a $640 \times 640$ image on Titan Xp (Sun et al., 2019); OLAT generation ≈0.35 s per direction on A100 (Chen et al., 15 Dec 2025).
Comparative studies indicate 50–75% lower error and magnitude order speedup over prior shape-from-shading, deep portrait relighting, and NeRF-based baselines (Sun et al., 2019, Rao et al., 17 Oct 2025, Rao et al., 2024).
User studies consistently prefer models with interaction, high-frequency detail transfer, and identity preservation (Mei et al., 2023, Chaturvedi et al., 16 Jan 2025, Rao et al., 17 Oct 2025).

Noted limitations include:

Failure on out-of-domain inputs (poses, accessories), artifacts in extreme specularities or harsh shadows (if underrepresented in training), and tradeoffs between structural fidelity and strong illumination effects (Sun et al., 2019, Rao et al., 17 Oct 2025, Chaturvedi et al., 16 Jan 2025).
Generalization is limited by dataset diversity in skin tone, age, hair style, and occlusion (Chaturvedi et al., 16 Jan 2025).
Simple SH or low-order basis lighting often restricts representation of non-Lambertian, high-frequency effects; 3D tri-plane architectures and OLAT-based approaches provide significant improvements (Chen et al., 15 Dec 2025, Rao et al., 17 Oct 2025, Mei et al., 2024).

7. Emerging Directions and Technical Extensions

Recent innovations in single-image portrait relighting include:

Large-scale, physically calibrated OLAT datasets (POLAR, FaceOLAT) for flow-based, high-fidelity per-light modeling (Chen et al., 15 Dec 2025, Rao et al., 17 Oct 2025).
3D GAN-disentangled pipelines with explicit volumetric rendering, multi-view GAN inversion, and pose/light decoupling for pose–dependent shadow/highlight (Mei et al., 2024, Rao et al., 2024, Rao et al., 17 Oct 2025).
Integration of diffusion models, both text- and image-driven, with conditioning modules (e.g., PGLA, spectral fixers, scribble-guided relighting) to enable interactive, semantically controlled lighting modulation (Liu et al., 17 Jun 2025, Mei et al., 2023).
Synthetic-to-real adaptation via small residual modules, contrastive, and cross-lighting losses to bridge the realism gap (Yeh et al., 2022, Chaturvedi et al., 16 Jan 2025).
Layered, frequency-adaptive recombination and attention mechanisms to preserve crisp facial features and achieve seamless background–foreground harmonization (Liu et al., 17 Jun 2025).
Extensions to video (temporal stabilization), expressive lighting (emotion, style, or time-of-day), and generalized subject types (cartoons/figurines/full-body) (Rao et al., 17 Oct 2025, Chaturvedi et al., 16 Jan 2025, Cha et al., 2024).

The field continues to advance toward physically accurate, identity-preserving, highly controllable portrait relighting from a single unconstrained input, leveraging synergistic progress in generative modeling, large-scale data collection, and computationally efficient rendering.