Spatiotemporal Lighting Estimation

Updated 22 December 2025

Spatiotemporal lighting estimation is the process of recovering detailed illumination distributions—including direction, spectrum, and intensity—across 3D space and time.
The approach combines physical modeling (environment maps, spherical harmonics, and parametric representations) with learning-based architectures for accurate per-point lighting recovery.
Applications include photorealistic rendering, augmented reality, and scientific analysis, though challenges remain in handling occlusions, dynamic scenes, and calibration accuracy.

Spatiotemporal lighting estimation refers to the quantitative recovery of lighting distributions, including their directionality, spectrum, intensity, and variation over both space and time, from visual or physical observations. The goal is to infer, at arbitrary 3D positions and time steps, environment maps or parametric representations that explain observed appearances or measured radiometric data with sufficient fidelity for rendering, relighting, perception modeling, or scientific analysis. The field encompasses modalities ranging from HDR image-based methods and neural scene representations, to 7D light-field measurement and transient (time-of-flight) acquisition, alongside learning-based spatiotemporal aggregation and efficient parametric compression.

1. Problem Formulation and Key Mathematical Models

Spatiotemporal lighting estimation generalizes traditional illumination recovery by targeting $\mathbf{L}(\mathbf{x}, t): \mathbb{S}^2 \rightarrow \mathbb{R}^3$ at arbitrary spatial query points $\mathbf{x} \in \mathbb{R}^3$ and discrete times $t$ , as opposed to global scene lighting or single static viewpoints. The objective is to reconstruct, at each $(\mathbf{x}, t)$ , a function over directions $\omega \in \mathbb{S}^2$ —e.g., an HDR environment map—that, when used for forward rendering, matches reference observations under arbitrary reflectance.

Formally, given a sequence of visual inputs $\{I_t\}_{t=1}^T$ and query points $\mathbf{c}$ , one seeks a set of environment maps $\{L_t\}$ such that, for any material, the rendered appearance of a canonical probe (e.g., virtual spheres with mirror or diffuse BRDF) placed at $\mathbf{c}$ under $L_t$ agrees with corresponding predictions or measurements. This is expressed as a minimization: $\hat L = \arg\min_{L} \sum_{t=1}^T \sum_{e \in \mathcal{E}} \sum_{m \in \{\text{mirror, diffuse}\}} \ell\bigl(\,\pi(e, m, t) - e \cdot \mathcal{R}(L_t, m)\,\bigr)$ where $\mathcal{R}$ is a differentiable renderer and $\ell$ includes both photometric and temporal penalties (Bolduc et al., 15 Dec 2025). Extensions of this model include the 7D plenoptic light field $L(\mathbf{x}, \theta, \phi, \lambda, t)$ (Yu et al., 2022) to account for wavelength and multi-point structure, as well as temporal profiles for transient imaging (Royo et al., 2023).

2. Parametric, Physical, and Data-driven Representations

Spatiotemporal lighting has been described using a spectrum of representations, each suited to different scales, modalities, and application domains.

Environment Maps: Discrete HDR spherical images at each $(\mathbf{x}, t)$ , supporting full-frequency lighting and arbitrary reflectance (Bolduc et al., 15 Dec 2025, Wei et al., 2020).
Spherical Harmonics Expansion: Compact, low-pass models for mobile and AR use cases, e.g., $L(\omega) \approx \sum_{l,m} c_{l,m} Y^m_l(\omega)$ , with SH coefficients estimated per location and time (Zhao et al., 2021).
Spherical Gaussian Lighting Volume (SGLV): Each spatial voxel parameterized by intensity, directional, and sharpness terms, allowing continuous spatial queries and physically-motivated blending (Li et al., 2023).
First-Order SH (Spectral Cubic Illumination): For measurement-based contexts, six orthogonal-axis spectral probes decompose illumination into zeroth-order (diffuse) and first-order (directional/vector) terms, enabling recovery of angular, spectral, and temporal structure up to linear accuracy (Yu et al., 2022).
EMG Mixture Models (Transient Domains): For time-resolved light transport, illumination at a pixel is modeled as a mixture of exponentially-modified Gaussians, compactly encoding multi-bounce/timing structure over $t$ with a small parameter set (Royo et al., 2023).

The choice of representation critically shapes the estimator’s spatial, temporal, spectral, and angular fidelity, resource requirements, and downstream rendering/application suitability.

3. Learning-based and Physics-informed Estimation Architectures

Modern spatiotemporal lighting estimation synthesizes data-driven and physically-informed approaches. Key pipelines can be abstracted as follows:

Diffusion-based Conditional Generation: Methods such as LiMo (Bolduc et al., 15 Dec 2025) employ pre-trained diffusion networks (Flux.1 Schnell, 12B parameters; Wan2.2, 5B) fine-tuned with explicit geometric conditioning—packaged as RGB, normal, depth, log-distance, and direction channels via a VAE encoder—for per-point, per-material, multi-exposure lighting probes. At inference, predicted probes are fused via differentiable optimization over $L$ .
Recurrent Volumetric Encoders: SGLV approaches (Li et al., 2023) employ 3D encoder-decoders with recurrent GRU refinement, integrating video cues and spatial consistency. A differentiable volumetric ray marcher and a hybrid 2D blend net enforce per-point angular accuracy and recovery of high-frequency details in visible regions.
Object-based Inverse Rendering: Decomposition pipelines split appearance into diffuse/specular terms (via U-Net) and analytically invert shading observations to partial environment maps, unifying physics-based rendering with learned map fusion and recurrent angular-domain convolution for temporal coherent outputs (Wei et al., 2020).
Transformer-based Aggregation: For outdoor or wide-FOV panoramic sequences (Lee et al., 2022), patch-wise visual tokens (ResNet-encoded) are temporally and spatially fused via vision transformers, employing carefully constructed cyclic positional encodings (position within FOV, relative yaw) to ensure global orientation alignment and robustness. All-patch outputs are calibrated via structure-from-motion for sun-direction estimation.
Hybrid Mobile 3D Vision Pipelines: Edge-assisted systems (e.g., Xihe (Zhao et al., 2021)) convert dense on-device point clouds to angularly-uniform anchor sets (unit-sphere-based) with compact per-anchor features, enabling fast, temporally-coherent SH coefficient estimation for AR.

These architectural choices trade-off bandwidth (e.g., EMG mixture compression achieves $>$ 50 $\times$ data reduction (Royo et al., 2023)), real-time operation, spatial/temporal coverage, and robustness to occlusions, object/material variations, or environmental dynamics.

4. Spatiotemporal Conditioning, Regularization, and Fusion

Accurate spatiotemporal estimation requires explicit conditioning and regularization strategies:

Geometric Conditioning: Embedding explicit 3D direction and distance maps (not just depth) is essential for resolving occlusion, shadowing, and per-point anisotropy. Depth alone is insufficient: ablations show severe degradation when geometric vectors are omitted (Bolduc et al., 15 Dec 2025).
Multi-material, Multi-exposure Fusion: Jointly leveraging mirror and diffuse probe predictions (multi-exposure) captures both high-frequency specular detail and robust low-frequency illumination/integrated radiance. Omission of either significantly worsens angular errors, especially for non-mirror materials (Bolduc et al., 15 Dec 2025).
Temporal Regularization: All leading methods introduce explicit temporal smoothness penalties (e.g., $\ell_1$ /MSE on differences between adjacent HDRIs, T-LPIPS between frames), recurrent fusion (GRU/ConvLSTM), or adaptive inference triggers (Xihe) (Li et al., 2023, Bolduc et al., 15 Dec 2025, Zhao et al., 2021).
Spectral and Directional Decomposition: In measurement-based paradigms, the separation between vectorial (directional) and symmetric (diffuse) components enables quantification of “diffuseness,” essential for perceptual metrics and scientific use (Yu et al., 2022).
Calibration and Aggregation: For methods aggregating temporally and spatially distributed cues, careful calibration (intrinsics, pose, orientational encoding) and ego-motion correction are indispensable. Transformer models inject angular and global positional encodings to solve for consistent sun vectors across wide-angle multi-view images (Lee et al., 2022).

Such conditioning and regularization underlie the state-of-the-art's ability to deliver per-point, temporally-stable, illumination estimates that are physically plausible and visually coherent.

5. Datasets, Evaluation Metrics, and Benchmarks

Progress has depended on large-scale, high-fidelity datasets, synthetic and real, and comprehensive quantitative protocols:

Data: Use of thousands of HDRI environments, multi-view camera placements, randomized probe insertions, and video sequences paired with full radiometric ground truth (e.g., 360,000 environment maps, 37,680 video sequences from OpenRooms (Li et al., 2023); 4,400 indoor/1,200 outdoor scenes in LiMo (Bolduc et al., 15 Dec 2025)).
Metrics:
- Pixel-wise and perceptual: RMSE (luminance), scale-invariant RMSE, SSIM, and angular error (RGB space).
- Temporal: T-LPIPS (temporal perceptual distance), T-LPIPS-Diff (deviation from GT temporal consistency), warped RMSE using optical flow.
- Task-driven: Relighting errors, object-render re-synthesis RMSE, task-specific success such as NLOS geometry recovery or sun vector error (median angular, mean angular).
- Scientific/Measurement-driven: CIE-computed illuminance, chromaticity, scalar/vector/diffuseness indices (Yu et al., 2022).
Benchmarks and Ablations: Leading methods consistently compare against prior architecturally dissimilar or ablated baselines (e.g., 4D Lighting, DiffusionLight, SGLV with/without blending or recurrence), and report performance on synthetic, real, and mixed datasets (Bolduc et al., 15 Dec 2025, Li et al., 2023). State-of-the-art approaches demonstrate improvements in error (e.g., mirror RMSE $\approx$ 0.25 vs. 0.34 for previous methods, angular error 4.4° vs. 14.7°) and temporal smoothness (T-LPIPS-Diff down 0.0009 vs. 0.0418) (Bolduc et al., 15 Dec 2025).

6. Applications, Strengths, and Limitations

Spatiotemporal lighting estimation underpins a wide range of applications in vision, graphics, augmented and mixed reality, and scientific measurement:

Rendering and AR: Per-point, HDR illumination for photorealistic object insertion, shadow rendering, and consistent lighting in dynamically moving or changing scenes (Wei et al., 2020, Zhao et al., 2021, Li et al., 2023).
Scientific and Perceptual Studies: Quantitative metrics for daylight, color constancy, shadow and highlight perception, and ergonomic/comfort assessment in architecture (Yu et al., 2022).
Inverse Problem Solving: NLOS (non-line-of-sight) reconstruction from compressed, temporally-resolved light transport; scene relighting in uncontrolled real-world settings (Royo et al., 2023, Wei et al., 2020).
Strengths: Unified handling of indoor and outdoor, true 3D spatial querying, high-frequency/low-frequency detail via material fusion, explicit spectral and temporal structure, efficient real-time implementation, and robustness to noise (Bolduc et al., 15 Dec 2025, Li et al., 2023, Yu et al., 2022).
Limitations: Sensitivity to deep shadow or near-occluder probe placements, lack of tailored semantic priors (e.g., no specialty handling of faces, lack of NeRF-style active occlusion tracking), dependency on accurate camera calibration and depth estimates, limited ability to hallucinate fine unseen geometry or handle dynamic moving objects/lighting (Bolduc et al., 15 Dec 2025, Li et al., 2023).

A plausible implication is that, while present methods generalize broadly and produce high-fidelity reconstructions, specialized and pointwise occlusion-aware algorithms, semantic reasoning, and deeper integration with temporally-varying geometry will further enhance accuracy and robustness in real-world deployments.

7. Future Directions

Identified research trajectories include:

Point-based Lighting Representations: Direct modeling of per-point, per-direction lighting capable of handling occluder proximity and localized light transport, beyond grid-based SGLV or probe-based optimization (Bolduc et al., 15 Dec 2025).
Semantic and Categorical Priors: Incorporation of specialized models for frequently encountered structures (faces, body parts, canonical objects) or scene classes to improve real-world accuracy (Bolduc et al., 15 Dec 2025).
Active Spatiotemporal Geometry Reasoning: Integration of dynamic NeRF-style or implicit surface/occluder modeling for tightly coupled lighting and scene structure estimation in dynamic environments (Bolduc et al., 15 Dec 2025).
Hybrid Measurement- and Learning-driven Systems: Combining portable, spectrally- and temporally-resolved measurement setups (e.g., spectral cube) with high-capacity learned priors for both scientific and applied scenes (Yu et al., 2022).
Scalable, Resource-efficient Pipelines: Further progress in compression (e.g., EMG mixtures (Royo et al., 2023)) and edge computation for deployment in mobile and distributed contexts (Zhao et al., 2021).
Handling of Dynamic Objects/Lighting: Robust temporal regularization, occlusion detection, and adaptive estimation for scenes where both geometry and lighting evolve rapidly (Li et al., 2023).

Continued progress relies on richer multi-modal datasets, cross-pollination of physical, geometric, and learning-based models, and increased emphasis on interpretability and generalization across highly variable real-world scenarios.