LiMo: Lighting in Motion

Updated 22 December 2025

Lighting in Motion is a framework that models spatiotemporally varying illumination in dynamic scenes using ML-based techniques.
It employs diffusion models, transformer networks, and neural radiance fields to disentangle lighting from shape, pose, and albedo factors.
LiMo enables realistic video synthesis, portrait relighting, and AR/VR applications while addressing challenges like real-time inference and scene generalization.

Lighting in Motion (LiMo) refers to a class of computational and ML-based frameworks designed to estimate, control, or synthesize temporally and spatially resolved lighting in dynamic scenes, with applications ranging from high-fidelity video synthesis and relighting to visual perception for autonomous systems. LiMo encompasses both the disentanglement and manipulation of lighting from other scene factors (shape, pose, albedo) and the accurate estimation of HDR illumination fields at arbitrary 3D points as objects and viewpoints vary over time.

1. Foundations: Definition and Scope

Lighting in Motion denotes approaches wherein lighting is explicitly represented, estimated, or controlled as a spatiotemporally varying field, distinct from static or globally parameterized illumination. LiMo targets tasks such as:

Spatiotemporal HDR lighting estimation at arbitrary 3D locations in a scene (Bolduc et al., 15 Dec 2025).
Explicit disentanglement and controllable synthesis of lighting (intensity, direction, chromaticity, and trajectory) in video generation (Zhang et al., 30 Oct 2024, Zheng et al., 11 Feb 2025).
Enablings for portrait relighting, talking-head synthesis, and 3D-aware avatar reenactment with temporally coherent and editable illumination (Zhao et al., 26 May 2025, Sun et al., 26 Dec 2024).

Unlike earlier relighting pipelines, for which lighting is a fixed or purely global latent, LiMo frameworks introduce mechanisms for fine-grained, per-frame, per-region lighting modulation, typically in conjunction with explicit 3D conditioning or reference paths.

2. Methodological Approaches

LiMo methods are defined by their architectural innovations and conditioning strategies that enable temporally adaptive lighting modeling.

a. Diffusion-based Spatiotemporal HDR Lighting Estimation

The LiMo framework (Bolduc et al., 15 Dec 2025) uses a diffusion model fine-tuned to inpaint the appearance of mirrored and diffuse spheres, placed synthetically at arbitrary scene points and with multiple EV (exposure value) brackets. Given conditioning maps (depth, normals, geometric directions), and a text prompt specifying sphere type and exposure, the model predicts the correct sphere appearance for the local lighting field.

Conditioning maps include: depth, normal at the sphere, direction vectors from background pixels to the query point, and distances.
The generated stack of bracketed spheres (mirror and diffuse, various EVs) is fused into a single HDR environment map via differentiable rendering, solving for the environment that, when rendered on parametric spheres, matches the predicted images.
For temporal tasks, a video diffusion model is conditioned on sequences of these maps, with loss terms enforcing both spatial (L2) and temporal (L1) consistency.

b. Diffusion and Transformer-Based Video Generation with LiMo Control

Several generative video frameworks integrate LiMo via cross-attention or plug-in modules:

Spatial Triple-Attention: VidCRAFT3 (Zheng et al., 11 Feb 2025) incorporates a lighting direction embedding (spherical-harmonic encoded and projected via MLP) as a third cross-attention branch, parallel to image and text, so each transformer block can simultaneously attend to per-frame lighting direction. The lighting direction is a 3D unit vector constant within a video but varied across the dataset, enabling precise, expressive directional control.
Plug-and-Play Latent Lighting Injection: LumiSculpt (Zhang et al., 30 Oct 2024) applies a per-frame lighting latent—encoded from a sequence of lighting reference images via VAE and transformer—into each layer and frame of a DiT-based video diffusion backbone. A disentanglement branch (frozen DiT) is used for a contrastive loss, penalizing entanglement between lighting and scene identity.
Disentangled Motion and Light Control: UniAvatar (Sun et al., 26 Dec 2024) decouples 3D motion and global illumination control by rendering motion and illumination "guidance" images with FLAME + SH lighting, injecting each independently as explicit conditions into spatial attention and U-Net diffusion blocks.

c. Volumetric and Radiance-Field Approaches

Total-Editing (Zhao et al., 26 May 2025) solves for LiMo by learning an intrinsic decomposition of view-dependent color into reflectance/albedo and physically-motivated shading (Phong) terms within a neural radiance field (NeRF) decoder. Lighting is encoded as either environment maps filtered into lightmaps (spherical harmonic or Phong-lobe), or as feature representations extracted from a relit portrait via U-Net and transformer modules. Deformation fields ensure that shading moves coherently with facial geometry.

3. Geometric and Physical Conditioning

Across modern LiMo frameworks, pure depth maps are insufficient for recovering accurate spatiotemporal lighting, especially under occlusion, shadow, or near-field conditions.

LiMo (Bolduc et al., 15 Dec 2025) introduces geometric maps: normals computed analytically at the predicted query point, direction vectors from all pixels to the probe center, and a distance channel. These are encoded along with standard model latents.
In generative models, spherical harmonics universally parameterize low-frequency lighting (SH basis, 9 or 16 components) (Sun et al., 26 Dec 2024, Zheng et al., 11 Feb 2025). For more accurate view- and location-specific illumination, environment maps or "lightmaps" are employed to pre-integrate incident light functions over sampled normals or view reflections (Zhao et al., 26 May 2025).

4. Training Objectives and Datasets

LiMo models for estimation or synthesis share common loss and data strategies:

Noise-prediction objective in diffusion models, standard within DDPM/DDIM frameworks (Bolduc et al., 15 Dec 2025, Zheng et al., 11 Feb 2025, Zhang et al., 30 Oct 2024).
Spatial (L2) and temporal (L1) rendering losses between predicted and true probe (sphere) images, sometimes with masking to handle saturation or occlusions.
Adversarial, LPIPS, and identity losses supplement perceptual fidelity in portrait/avatar use cases (Zhao et al., 26 May 2025).
Specialized datasets: LiMo's estimation work uses a large-scale synthetic corpus of indoor/outdoor scenes with multi-exposure HDR spheres at randomized depths and across camera/object/light trajectories (Bolduc et al., 15 Dec 2025). LumiSculpt constructs LumiHuman—MetaHuman characters under dense 3D grids of point-light trajectories and procedurally captioned lighting prompts (Zhang et al., 30 Oct 2024). VidCRAFT3's VideoLightingDirection dataset is generated via Blender, sampling HDR and geometric ground-truth (Zheng et al., 11 Feb 2025).

5. Quantitative and Qualitative Evaluation

Multiple recent works report superior performance by explicitly modeling LiMo:

Model	Domain	Task	PSNR↑	SSIM↑	LPIPS↓	FID↓	Lighting Err.↓
LiMo (Bolduc et al., 15 Dec 2025)	HDR, estimation	HDR probe recovery	—	0.95	—	—	4.4° (RGB ang.)
Total-Editing (Zhao et al., 26 May 2025)	Portrait video	Face relight/reenact	20.3dB	0.73	0.226	38.3	—
UniAvatar (Sun et al., 26 Dec 2024)	Talking-head	Motion + lighting ctrl	24.5dB	—	0.175	74.4	—
LumiSculpt (Zhang et al., 30 Oct 2024)	Gen. video	Lighting consistency	—	—	1.131	—	0.35 (RMSE)
VidCRAFT3 (Zheng et al., 11 Feb 2025)	Open-domain vid.	Relighting synthesis	19.5dB	0.74	0.11	—	—

Qualitatively, LiMo frameworks recover fine HDR highlights (small lamps, windows) and temporally coherent shadow trajectories. Video generation models (LumiSculpt, VidCRAFT3, UniAvatar) achieve artifact-free, temporally smooth relighting and preserve motion-illumination independence under large pose or direction ranges.

6. Applications and Limitations

LiMo methodologies underpin critical advances in:

Spatiotemporal HDRI capture for realistic object insertion, AR/VR asset relighting, and mixed-reality composition (Bolduc et al., 15 Dec 2025).
Portrait video and avatar reenactment with physically-plausible dynamic illumination, surpassing two-stage reenactment+relighting cascades in fidelity and identity preservation (Zhao et al., 26 May 2025, Sun et al., 26 Dec 2024).
Open-domain, text-driven video synthesis with full control over lighting, object, and camera motion (Zhang et al., 30 Oct 2024, Zheng et al., 11 Feb 2025).

Limitations:

Real-time inference is not yet feasible for most detailed LiMo pipelines, due to large model and optimization overhead (Zhao et al., 26 May 2025, Bolduc et al., 15 Dec 2025).
Current approaches may lack explicit visibility/shadow computation for out-of-distribution objects (e.g., occlusion by accessories).
Estimation methods rely heavily on synthetic datasets; generalization to real, uncontrolled scenes requires further research.

7. Perspectives and Future Directions

LiMo has established itself as a state-of-the-art paradigm for spatiotemporal lighting estimation and control, enabling unmatched precision in both measurement and generative synthesis. Ongoing research focuses on:

Reducing computational demands via sparse voxel fields and efficient visibility modeling (Zhao et al., 26 May 2025).
Expanding datasets to cover more diverse real-world lighting/material/geometry regimes, with procedural and crowdsourced annotation (Zheng et al., 11 Feb 2025, Zhang et al., 30 Oct 2024).
Joint modeling of lighting and other physical factors such as weather, environment texture, sensor characteristics, and multi-modal cues (audio for video synthesis).
Integration of LiMo-inspired conditioning in autonomous perception, such as ego-motion estimation in extreme low-light via active illumination frameworks (Crocetti et al., 19 Feb 2025).

Lighting in Motion delineates the current frontier in computational illumination, synthesizing advances in diffusion modeling, neural rendering, structured conditioning, and geometric physics for temporally and spatially controllable light management across vision, graphics, and robotics domains.