WildCap: Hybrid Inverse Rendering
- WildCap is a hybrid inverse rendering method that captures photorealistic facial reflectance from unconstrained smartphone video by combining data-driven 'delighting' with physics-based optimization.
- It employs a texel-grid spherical harmonics lighting model to adaptively correct baked shadow artifacts and deliver artifact-free facial textures.
- Joint optimization with a patch-level diffusion prior enforces physical consistency, achieving near studio-quality results even under complex, ambient lighting.
WildCap is a hybrid inverse rendering methodology designed for high-quality facial appearance capture from unconstrained smartphone video, narrowing the quality gap with Light-Stage studio solutions while operating in “the wild” under arbitrary illumination (Han et al., 12 Dec 2025). It integrates a data-driven “delighting” process with physics-motivated model-based optimization, introducing novel approaches to disentangle facial reflectance from confounding lighting and baked shadow artifacts. The resulting pipeline achieves photorealistic recovery of facial reflectance maps, supporting applications requiring relightable, artifact-free face reconstruction without controlled capture environments.
1. Hybrid Inverse Rendering Framework
WildCap’s workflow begins with a short (approximately 30 s) handheld smartphone video under ambient illumination. The preprocessing pipeline computes camera poses using COLMAP, acquires high-quality mesh reconstruction via 2DGS, and performs template mesh registration with Wrap3D. A subset of well-focused frames is selected for processing.
The first stage applies SwitchLight, a pretrained data-driven network, to convert each frame to a diffuse-albedo image simulating uniform white lighting. This step removes most specular highlights and converts input frames into a constrained pseudo-studio regime, but preserves non-physical baked shadow artifacts.
In the second stage, these SwitchLight outputs are projected into a common UV texture atlas using registered geometry. A physically plausible albedo map and a local lighting model are then jointly optimized so that rendering under matches . The key innovations are:
- A per-texel grid-based spherical harmonics (SH) lighting field, enabling spatially adaptive correction and removal of baked shadow artifacts.
- A patch-level diffusion prior over reflectance maps (diffuse albedo, detailed normals , specular albedo ), enforcing output consistency with physical facial reflectance.
Final outputs comprise 1K diffuse albedo, normal, and specular maps, a texel-grid SH lighting field for relighting, and optional 4K super-resolved outputs via RCAN.
2. Data-Driven “Delighting” with SwitchLight
The SwitchLight “delighting” network, as described by Kim et al. (2024), uses an encoder–decoder backbone with physics-guided components to model low-frequency diffuse and higher-frequency specular removal. It ingests a single portrait image (e.g., in sRGB) under unknown, potentially complex illumination, outputting a 3-channel diffuse-albedo image as though illuminated by uniform white light.
SwitchLight’s output preserves geometry-driven shading while removing specularities but retains shadow-baking artifacts the network cannot disentangle fully. This step is crucial, as it transforms uncontrolled illumination into a domain amenable to model-based inverse rendering, substantially reducing the optimization burden and isolating shadow artifacts as the principal remaining confounder.
3. Texel-Grid Lighting Model for Artifact Correction
WildCap introduces a texel-grid lighting model to address baked shadows unremovable by global SH or environment maps. Define as a binary shadow mask in UV space indicating locations of baked shadows. Lighting is parameterized with a global SH coefficient vector ( for second order), and a local grid (grid step ).
For each UV coordinate , the local SH lighting is interpolated:
Rendering for a texel is performed using Lambertian SH shading:
where is albedo, is the coarse normal, are SH integrals for Lambertian BRDF, and are SH basis functions.
In regions where , the grid allows for local “dark SH lights” that cancel baked shadows. Elsewhere, only the smooth global SH is used, preserving physically plausible low-frequency shading.
4. Joint Optimization and Diffusion Prior Integration
The core optimization employs a photometric objective enforcing match between rendered output and SwitchLight UV texture:
Regularization comprises total variation and negativity loss:
- Total variation on :
- Negativity loss to force shadow SH:
Combined lighting regularization:
A patch-level diffusion prior , pretrained on 48 Light-Stage scans, models the joint distribution of on patches. During optimization, reflectance maps are sampled via diffusion posterior sampling—incorporating photometric gradients to steer the prior—while scale ambiguity is resolved by initializing albedo from a Light-Stage reference scan matched to skin tone.
Full update equations are:
- Reverse diffusion:
- Clean estimate:
- Posterior sampling (gradient step on ):
- Gradient descent on lighting:
where are step-size schedules and .
5. Implementation and Practical Considerations
The UV atlas is set at with grid step yielding of size . Diffusion runs for steps (), gradients step , initial learning rate with exponential decay. Photometric fitting of to the texture uses LPIPS plus gradient loss. Maps are upsampled from 1K to 4K using RCAN in approximately eight minutes on NVIDIA RTX 4090 hardware.
Convergence is robust, terminating after steps; empirical stability is observed for modest variations in schedule parameters.
6. Evaluation Against Baselines and Prior Work
Ablation studies validate each component:
- Omission of SwitchLight (“w/o hybrid”) degrades performance under complex illumination.
- Removal of texel-grid lighting (“w/o TGL”) leaves persistent shadow artifacts.
- Excluding the diffusion prior (“w/o prior”) yields physically implausible texture.
Grid step offers the best balance of expressivity and overfitting. Comparisons with in-the-wild baselines (DeFace [Huang et al.], FLARE [Bharadwaj et al.], and variants feeding SwitchLight outputs) show notably improved artifact removal and fidelity for WildCap.
Quantitative reconstruction metrics (averaged over six subjects, PSNR/SSIM/LPIPS):
| Method | PSNR | SSIM | LPIPS |
|---|---|---|---|
| DeFace* | 22.20 | 0.9279 | 0.1192 |
| FLARE* | 27.81 | 0.9411 | 0.0929 |
| WildCap | 28.79 | 0.9520 | 0.0610 |
WildCap nearly matches DoRA [Han et al. 2025] under Light-Stage conditions, closing the gap between uncontrolled and studio capture quality.
Qualitative results highlight clean albedo free of baked shadows, retention of fine skin details, and photorealistic relighting under novel environments.
7. Limitations and Future Directions
WildCap currently depends on a closed-source SwitchLight API for the initial “delighting”; further, automatic estimation of the shadow mask via DiFaReli++ is slow and imprecise, with manual annotation delivering optimal performance. Residual artifacts may persist in cases of sharp shadow boundaries due to SH basis smoothness.
Future work will address these limitations by developing end-to-end portrait-delighting networks with baked-shadow uncertainty, curating large open Light-Stage databases using WildCap for studio capture processing (as in NeRSemble), and exploring higher-order or non-SH local basis expansions for artifact removal.
A plausible implication is that further development of open-source “delighting” and segmentation algorithms will improve WildCap’s accessibility and robustness for uncontrolled facial video acquisition.