Papers
Topics
Authors
Recent
2000 character limit reached

WildCap: Hybrid Inverse Rendering

Updated 19 December 2025
  • WildCap is a hybrid inverse rendering method that captures photorealistic facial reflectance from unconstrained smartphone video by combining data-driven 'delighting' with physics-based optimization.
  • It employs a texel-grid spherical harmonics lighting model to adaptively correct baked shadow artifacts and deliver artifact-free facial textures.
  • Joint optimization with a patch-level diffusion prior enforces physical consistency, achieving near studio-quality results even under complex, ambient lighting.

WildCap is a hybrid inverse rendering methodology designed for high-quality facial appearance capture from unconstrained smartphone video, narrowing the quality gap with Light-Stage studio solutions while operating in “the wild” under arbitrary illumination (Han et al., 12 Dec 2025). It integrates a data-driven “delighting” process with physics-motivated model-based optimization, introducing novel approaches to disentangle facial reflectance from confounding lighting and baked shadow artifacts. The resulting pipeline achieves photorealistic recovery of facial reflectance maps, supporting applications requiring relightable, artifact-free face reconstruction without controlled capture environments.

1. Hybrid Inverse Rendering Framework

WildCap’s workflow begins with a short (approximately 30 s) handheld smartphone video under ambient illumination. The preprocessing pipeline computes camera poses using COLMAP, acquires high-quality mesh reconstruction via 2DGS, and performs template mesh registration with Wrap3D. A subset of V16V \approx 16 well-focused frames IrawiI_{\text{raw}}^i is selected for processing.

The first stage applies SwitchLight, a pretrained data-driven network, to convert each frame to a diffuse-albedo image IiI^i simulating uniform white lighting. This step removes most specular highlights and converts input frames into a constrained pseudo-studio regime, but preserves non-physical baked shadow artifacts.

In the second stage, these SwitchLight outputs are projected into a common UV texture atlas IUVRH×W×3I_{UV} \in \mathbb{R}^{H \times W \times 3} using registered geometry. A physically plausible albedo map ARH×W×3A \in \mathbb{R}^{H \times W \times 3} and a local lighting model Γθ\Gamma_\theta are then jointly optimized so that rendering AA under Γθ\Gamma_\theta matches IUVI_{UV}. The key innovations are:

  • A per-texel grid-based spherical harmonics (SH) lighting field, enabling spatially adaptive correction and removal of baked shadow artifacts.
  • A patch-level diffusion prior over reflectance maps (diffuse albedo, detailed normals NdN_d, specular albedo SS), enforcing output consistency with physical facial reflectance.

Final outputs comprise 1K diffuse albedo, normal, and specular maps, a texel-grid SH lighting field for relighting, and optional 4K super-resolved outputs via RCAN.

2. Data-Driven “Delighting” with SwitchLight

The SwitchLight “delighting” network, as described by Kim et al. (2024), uses an encoder–decoder backbone with physics-guided components to model low-frequency diffuse and higher-frequency specular removal. It ingests a single portrait image (e.g., 960×720960 \times 720 in sRGB) under unknown, potentially complex illumination, outputting a 3-channel diffuse-albedo image IdI^d as though illuminated by uniform white light.

SwitchLight’s output preserves geometry-driven shading while removing specularities but retains shadow-baking artifacts the network cannot disentangle fully. This step is crucial, as it transforms uncontrolled illumination into a domain amenable to model-based inverse rendering, substantially reducing the optimization burden and isolating shadow artifacts as the principal remaining confounder.

3. Texel-Grid Lighting Model for Artifact Correction

WildCap introduces a texel-grid lighting model to address baked shadows unremovable by global SH or environment maps. Define M(u,v){0,1}M(u, v) \in \{0, 1\} as a binary shadow mask in UV space indicating locations of baked shadows. Lighting is parameterized with a global SH coefficient vector γgRNc\gamma^g \in \mathbb{R}^{N_c} (Nc=27N_c = 27 for second order), and a local grid VR(H/g)×(W/g)×NcV \in \mathbb{R}^{(H/g) \times (W/g) \times N_c} (grid step g=96g = 96).

For each UV coordinate (u,v)(u,v), the local SH lighting is interpolated:

γ(u,v)=γg+M(u,v)γv(u,v)\gamma(u,v) = \gamma^g + M(u,v) \cdot \gamma^v(u,v)

Rendering for a texel is performed using Lambertian SH shading:

c(u,v)=aπl=02m=llBlγlm(u,v)Ylm(n)c(u,v) = \frac{a}{\pi}\sum_{l=0}^2 \sum_{m=-l}^l B_l\,\gamma_{lm}(u,v)\,Y_{lm}(n)

where a=A(u,v)a = A(u,v) is albedo, n=Nc(u,v)n = N_c(u,v) is the coarse normal, BlB_l are SH integrals for Lambertian BRDF, and YlmY_{lm} are SH basis functions.

In regions where M=1M = 1, the grid allows for local “dark SH lights” that cancel baked shadows. Elsewhere, only the smooth global SH is used, preserving physically plausible low-frequency shading.

4. Joint Optimization and Diffusion Prior Integration

The core optimization employs a photometric objective enforcing match between rendered output and SwitchLight UV texture:

Lpho(A,θ)=IUVΓθ(A,Nc)22\mathcal{L}_{\text{pho}}(A, \theta) = \lVert I_{UV} - \Gamma_\theta(A, N_c) \rVert^2_2

Regularization comprises total variation and negativity loss:

  • Total variation on γ(u,v)\gamma(u,v):

LTV=u,vγu,vγu1,v2+γu,vγu,v12\mathcal{L}_{TV} = \sum_{u,v} \left\| \gamma_{u,v}-\gamma_{u-1,v} \right\|^2 + \left\| \gamma_{u,v}-\gamma_{u,v-1} \right\|^2

  • Negativity loss to force shadow SH:

Lneg=u,vmax(0,su,vv)2\mathcal{L}_{neg} = \sum_{u,v}\max(0,s^v_{u,v})^2

Combined lighting regularization:

Lreg(θ)=0.1LTV+Lneg\mathcal{L}_{reg}(\theta) = 0.1 \cdot \mathcal{L}_{TV} + \mathcal{L}_{neg}

A patch-level diffusion prior ϵ(xt,t)\epsilon(x_t, t), pretrained on 48 Light-Stage scans, models the joint distribution of (A,Nd,S)(A, N_d, S) on 64×6464 \times 64 patches. During optimization, reflectance maps are sampled via diffusion posterior sampling—incorporating photometric gradients to steer the prior—while scale ambiguity is resolved by initializing albedo AA from a Light-Stage reference scan x0refx_0^{ref} matched to skin tone.

Full update equations are:

  • Reverse diffusion:

xt1=1αt(xt1αt1αˉtϵ(xt,t))+σtzx'_{t-1} = \frac{1}{\sqrt{\alpha_t}}\left(x_t - \frac{1 - \alpha_t}{\sqrt{1 - \bar{\alpha}_t}}\epsilon(x_t, t)\right) + \sigma_t z

  • Clean estimate:

x^t=1αˉt(xt1αˉtϵ(xt,t))\hat{x}_t = \frac{1}{\sqrt{\bar\alpha_t}}(x_t - \sqrt{1 - \bar\alpha_t}\epsilon(x_t, t))

  • Posterior sampling (gradient step on AA):

xt1=xt1ζtxtLpho(x^t,θt)x_{t-1} = x'_{t-1} - \zeta_t \nabla_{x_t} \mathcal{L}_{pho}(\hat{x}_t, \theta_t)

θt1=θtηtθt(Lpho(x^t,θt)+Lreg(θt))\theta_{t-1} = \theta_t - \eta_t \nabla_{\theta_t}(\mathcal{L}_{pho}(\hat{x}_t, \theta_t) + \mathcal{L}_{reg}(\theta_t))

where ζt,ηt\zeta_t, \eta_t are step-size schedules and zN(0,I)z \sim \mathcal{N}(0, I).

5. Implementation and Practical Considerations

The UV atlas is set at H=W=1024H=W=1024 with grid step g=96g=96 yielding VV of size 11×11×2711 \times 11 \times 27. Diffusion runs for T=1000T=1000 steps (Tinit=600T_{init}=600), gradients step ζt1\zeta_t \equiv 1, initial learning rate η1=0.01\eta_1=0.01 with exponential decay. Photometric fitting of IiI_i to the texture uses LPIPS plus gradient loss. Maps are upsampled from 1K to 4K using RCAN in approximately eight minutes on NVIDIA RTX 4090 hardware.

Convergence is robust, terminating after TT steps; empirical stability is observed for modest variations in schedule parameters.

6. Evaluation Against Baselines and Prior Work

Ablation studies validate each component:

  • Omission of SwitchLight (“w/o hybrid”) degrades performance under complex illumination.
  • Removal of texel-grid lighting (“w/o TGL”) leaves persistent shadow artifacts.
  • Excluding the diffusion prior (“w/o prior”) yields physically implausible texture.

Grid step g=96g=96 offers the best balance of expressivity and overfitting. Comparisons with in-the-wild baselines (DeFace [Huang et al.], FLARE [Bharadwaj et al.], and variants feeding SwitchLight outputs) show notably improved artifact removal and fidelity for WildCap.

Quantitative reconstruction metrics (averaged over six subjects, PSNR/SSIM/LPIPS):

Method PSNR SSIM LPIPS
DeFace* 22.20 0.9279 0.1192
FLARE* 27.81 0.9411 0.0929
WildCap 28.79 0.9520 0.0610

WildCap nearly matches DoRA [Han et al. 2025] under Light-Stage conditions, closing the gap between uncontrolled and studio capture quality.

Qualitative results highlight clean albedo free of baked shadows, retention of fine skin details, and photorealistic relighting under novel environments.

7. Limitations and Future Directions

WildCap currently depends on a closed-source SwitchLight API for the initial “delighting”; further, automatic estimation of the shadow mask MM via DiFaReli++ is slow and imprecise, with manual annotation delivering optimal performance. Residual artifacts may persist in cases of sharp shadow boundaries due to SH basis smoothness.

Future work will address these limitations by developing end-to-end portrait-delighting networks with baked-shadow uncertainty, curating large open Light-Stage databases using WildCap for studio capture processing (as in NeRSemble), and exploring higher-order or non-SH local basis expansions for artifact removal.

A plausible implication is that further development of open-source “delighting” and segmentation algorithms will improve WildCap’s accessibility and robustness for uncontrolled facial video acquisition.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Whiteboard

Topic to Video (Beta)

Follow Topic

Get notified by email when new papers are published related to WildCap.