Augmented Shadow Face in the Wild (ASFW)
- Augmented Shadow Face in the Wild (ASFW) is a real-world benchmark with 1,081 paired images offering accurate shadow and shadow-free comparisons for facial restoration.
- It employs a bidirectional, Photoshop-based four-stage workflow to synthesize realistic shadows, bridging the gap between synthetic and real-world data.
- The accompanying FSE framework, a three-stage neural network, delivers state-of-the-art results with notable improvements in PSNR and SSIM metrics.
Augmented Shadow Face in the Wild (ASFW) constitutes the first large-scale real-world benchmark for facial shadow removal, comprising 1,081 paired images of real faces—each pair consisting of a shadowed and a pixel-aligned shadow-free version. Constructed via a professional and bidirectional manual process in Adobe Photoshop, ASFW exhibits diverse and photorealistic shadow types as well as accurate ground truth for shadow removal tasks. The dataset bridges the synthetic–real domain gap found in previous benchmarks and enables rigorous evaluation of facial shadow removal algorithms. Its utility is demonstrated through the introduction of the Face Shadow Eraser (FSE), a multi-stage neural framework that attains state-of-the-art results in both quantitative and qualitative assessment on challenging real-world conditions (Luo et al., 27 Jan 2026).
1. Dataset Composition and Construction
1.1 Paired Samples and Usage
ASFW comprises 1,081 real-world facial image pairs, each containing a shadowed photo and its meticulously aligned shadow-free counterpart. In comparative experiments, ASFW is used solely as a single, held-out test split—no explicit train/val/test partitions are defined within the dataset, emphasizing its role as an evaluation benchmark.
1.2 Photoshop-Based Four-Stage Bidirectional Workflow
ASFW is generated using a manual, bidirectional, four-stage pipeline in Adobe Photoshop:
- Shadow Synthesis: Artificial shadows are added to originally shadow-free images, with controlled brush flow (10–30%) and opacity (15–45%) settings to accurately mimic the softness and transitions of real facial shadows. Shadows are mapped in accordance with three-dimensional facial landmarks (e.g., nasal bridge, orbital areas, zygomatic arches). Shadow edges are rendered using dual diameters (hard: 5–15 px, soft: 25–50 px) with pressure-sensitive opacity, and diversity is further introduced by simulating occlusions from hair, hats, hands, and micro-shadows from skin details (wrinkles, pores).
- Shadow Removal: The shadow-free counterparts are created via lasso-based segmentation with adaptive feathering, followed by local brightness and color correction with feathered masks to eliminate halos. Edge artifacts are addressed using the Spot Healing Brush with content-aware sampling, and skin texture is restored using Content-Aware Fill, Clone Stamp, and Mixer Brush.
1.3 Shadow Diversity and Image Attributes
The dataset encompasses a wide array of shadow phenomena, including hard versus soft edges, various occlusions (hair, hats, hands), micro-shadows from skin texture, and illumination from differing angles (frontal, side, top light). The empirical distribution of shadow types can be denoted as
though explicit statistics regarding their proportions are not reported. Image identities span a broad diversity of age, gender, skin tone, pose, and lighting, resulting in a challenging and realistic corpus for algorithm evaluation.
2. Face Shadow Eraser (FSE) Framework
2.1 Three-Stage Architecture
The FSE is a cascaded, lightweight, three-stage deep network for shadow removal:
- MaskGuideNet: Generates a soft shadow probability map.
- CoarseGenNet: Produces a coarse, shadow-free facial image.
- RefineFaceNet: Refines structural and photometric details, correcting fine textures and illumination.
The respective mapping from an input image (optionally with initial mask ) to reconstructed shadow-free image is:
where “” denotes channel-wise concatenation and “” denotes functional composition.
2.2 MaskGuideNet: Soft Shadow Map Generation
- Input: 4-channel tensor .
- Architecture: Encoder–decoder network with Conv+ReLU residual blocks.
- Output: Soft shadow probability map via a channel-wise sigmoid:
2.3 CoarseGenNet: Coarse Shadow Removal
- Input: Concatenated .
- Architecture: Initial 3×3 Conv+ReLU for feature extraction, four AggBlock modules executing dynamic convolutions across dilation rates (e.g., dilation = {1, 2, 3}), final 3×3 Conv to produce coarse output .
2.4 RefineFaceNet: Structural and Photometric Refinement
- Purpose: Refines residual artifacts, leveraging both global-local context (Swin-Transformer inspired) and mask-conditioned feature modulation.
- Core Components:
- Adaptive Hierarchical Shift-Window Attention (AHSWA): Alternates regular/shifted windows, scaled dot-product attention, and depthwise convolutions.
- Illumination Refinement Component (IRC): Applies mask-conditioned convolutional scale () and bias () to enhance domain adaptation.
The final combine step:
where “” is element-wise multiplication.
3. Training Objectives and Loss Functions
FSE is trained to minimize a weighted sum of three losses, without adversarial objectives or perceptual (“VGG-feature”)-based losses:
with , . Definitions:
- : pixelwise mean squared error,
- : , structural similarity index loss,
- : learned perceptual image patch similarity.
The sum targets both perceptual and pixel-level fidelity, promoting high-fidelity texture preservation and accurate shadow removal.
4. Benchmarking: Experimental Results and Ablations
4.1 Quantitative Evaluation
Evaluations were performed on the held-out ASFW set (1,081 pairs) and the smaller UCB dataset (100 pairs), using PSNR (↑), SSIM (↑), MSE (↓), and LPIPS (↓). A summary of results on ASFW:
| Method | PSNR↑ | SSIM↑ | MSE↓ | LPIPS↓ |
|---|---|---|---|---|
| BMNet [Zhu et al. 2022] | 23.65 | 0.927 | 0.009 | 0.069 |
| FSE + ASFW | 25.45 | 0.930 | 0.006 | 0.066 |
This represents a +1.8 dB PSNR improvement over the highest-performing prior system, indicating the increased challenge and value of ASFW as a benchmark.
4.2 Ablation Analysis
| Configuration | MaskGuide | CoarseGen | RefineFace | PSNR↑ | SSIM↑ | MSE↓ | LPIPS↓ |
|---|---|---|---|---|---|---|---|
| Full | ✓ | ✓ | ✓ | 25.45 | 0.930 | 0.006 | 0.066 |
| – MaskGuide | ✗ | ✓ | ✓ | 22.47 | 0.905 | 0.009 | 0.086 |
| – CoarseGen | ✓ | ✗ | ✓ | 21.48 | 0.864 | 0.010 | 0.126 |
| – RefineFace | ✓ | ✓ | ✗ | 22.93 | 0.886 | 0.008 | 0.112 |
The incremental performance drops from omitting any single module underscore the essential contributions of all three components, with the coarse-to-fine progression and the RefineFaceNet proving particularly necessary for high-fidelity output.
4.3 Qualitative Assessment
Visual inspection on ASFW and UCB demonstrates that FSE excels in removing strong, real-world facial shadows (e.g., across cheeks and under brows) while maintaining photorealistic detail, including pores, precise skin tone transitions, and individual hair strands. FSE outperforms alternatives (e.g., Lyu et al., FSRNet, CIRNet) consistently across diverse cases.
4.4 User Studies and Downstream Vision Tasks
No user studies or downstream vision task evaluations are reported. The primary focus is on photorealistic, high-fidelity shadow removal as measured by restoration quality.
5. Significance and Impact in Shadow Removal Research
By introducing the first large-scale, photorealistic, real-world paired facial shadow benchmark (ASFW) and a modular, lightweight, and effective neural architecture (FSE), this work sets new standards for both data resources and algorithmic solutions in high-fidelity shadow removal. ASFW enables challenging, attribute-rich evaluation previously infeasible with synthetic or small-scale datasets. The bridging of domain gaps and methodological rigor addresses longstanding deficits in both data quality and algorithm performance, facilitating progress toward production-ready facial restoration systems (Luo et al., 27 Jan 2026).