GSRA: Geometric-Semantic Rectification Attention
- The paper introduces GSRA as a cross-modal attention module that efficiently fuses geometry and semantic features to enhance shadow removal performance.
- It details a multi-stage Transformer-based design that aligns depth/normal cues with semantic embeddings via differential rectification to improve local edge precision while reducing noise.
- Empirical evaluations demonstrate that GSRA significantly boosts PSNR and SSIM in shadow removal tasks across both single-source and multi-source illumination scenarios.
Geometric-Semantic Rectification Attention (GSRA) is a cross-modal attention mechanism developed for robust shadow removal under single and multi-source illumination. Introduced within the PhaSR (Physically Aligned Shadow Removal) framework, GSRA achieves local harmonization of geometry- and semantics-derived features by differential rectification, producing feature maps that are both spatially precise at shadow edges and semantically coherent within homogeneous regions. Its design addresses the challenge of modal misalignment between geometry features (which are sensitive but noisy) and semantic embeddings (which are robust yet coarse), enabling advanced shadow removal where prior methods are limited by direct-illumination or global-texture assumptions (Lee et al., 24 Jan 2026).
1. Functional Role in the PhaSR Framework
GSRA operates in the second stage of the two-step PhaSR pipeline. Following Physically Aligned Normalization (PAN)—which executes closed-form illumination correction—GSRA is embedded within a Transformer encoder-decoder to perform local, scale-adaptive cross-modal alignment. Specifically, GSRA fuses depth and normal features extracted using DepthAnything-V2 (geometric priors) with semantic embeddings computed by DINO-v2 (semantic priors). The geometric features exhibit strong edge sensitivity but produce noise in low-gradient regions, while the semantic features provide globally stable descriptors but fail to sharply resolve illumination boundaries. By rectifying semantic attention via differential subtraction of geometry-derived attention, GSRA resolves cross-modal conflicts that arise in heterogeneous illumination settings.
2. Architectural Design and Module Flow
GSRA comprises three principal sub-components at each multi-scale decoding stage (and optionally at the deepest encoder layer):
- Prior Injection: The shared query features are augmented with geometric and semantic priors:
- Here, are depth/normal and semantic features, respectively, projected to a common channel dimension, and are learned scalars controlling the injection strength.
- Key–Value Projections: The injected features are linearly projected into keys and values for each modality:
- Differential Attention and Fusion:
- With shared queries , the attention maps are:
- The rectified attention is then computed as , with a learnable, possibly scale-specific scalar.
- Output fusion occurs as .
- With shared queries , the attention maps are:
The fused output is subsequently processed by Transformer blocks (TDB/PATB) before upsampling or bypassing to higher resolutions.
3. Mathematical Formulation
The GSRA mechanism is fully governed by the following equations:
| Step | Mathematical Expression | Description |
|---|---|---|
| Prior Injection | <br> | Incorporate geometry and semantic priors into the query stream |
| KV Projection | <br> | Modality-specific KV computation |
| Attention Maps | <br> | Separate geometry and semantic attention weights |
| Differential Rect. | Semantic rectification by geometry gating | |
| Output Fusion | Final cross-modal fused output |
No exclusive GSRA loss or explicit alignment regularizer is used; all alignment is achieved via end-to-end backpropagation of the overall reconstruction losses, namely .
4. Mechanism for Cross-Modal Conflict Resolution
GSRA directly addresses the conflicting behaviors of geometry and semantic streams in the attention mechanism. Geometry features, while excelling at highlighting shading and edge gradients, are prone to overreact to any local intensity change, resulting in noise within untextured regions. Semantic features offer global consistency and robustness to illumination changes but fail to resolve fine-grained shadow boundaries. By learning a scale-dependent , GSRA adaptively subtracts geometry attention from the semantic map:
- In uniform (smooth) areas, is minimal, yielding , thus semantic features dominate and preserve smoothness.
- At true shadow boundaries, spikes in induce to sharpen attention, enhancing localization of boundary features.
This selective gating allows GSRA to maintain texture integrity in interiors while retaining precise localization at transitions, resolving the modal misalignment without reliance on hand-engineered rules or handcrafted fusion heuristics.
5. Training Objectives and Implementation Details
GSRA is trained within PhaSR under the standard image reconstruction objectives:
- Charbonnier loss () penalizes photometric differences.
- SSIM loss () enforces structural similarity.
No additional loss is imposed on GSRA or its alignment outputs. All module parameters, including , and , are optimized jointly with the rest of the framework by minimizing total loss on shadow and reflectance reconstruction.
6. Empirical Evaluation and Quantitative Impact
Ablation studies confirm the critical role of GSRA in successful shadow removal across direct and ambient lighting. On the ISTD+ dataset, the full model (PAN+GSRA) achieves PSNR = 34.48 dB and SSIM = 0.960. Replacing GSRA with ordinary cross-attention results in a drop to PSNR = 32.56 / SSIM = 0.934 (−1.92 dB, −0.026). Setting (i.e., no rectification) yields intermediate results (PSNR = 32.89 / SSIM = 0.951). Ablating either prior (setting or to zero) degrades performance by approximately 1 dB. On WSRD+ (indoor, single-light), GSRA contributes about 1.5 dB PSNR gain and 0.02 SSIM improvement over vanilla attention fusion.
When evaluated on Ambient6K—a multi-source, ambient illumination shadow normalization set—GSRA-equipped PhaSR achieves SSIM = 0.834 and PSNR = 23.32 dB, surpassing geometry-only and semantic-only models such as DenseSR (22.78 / 0.830) and OmniSR (23.01 / 0.832). Cross-dataset analysis reveals that GSRA’s explicit cross-modal rectification yields reduced degradation in domain transfer (from single-source to multi-source shadowing), underscoring its effectiveness in generalizing to complex, colored shadow scenarios.
7. Summary and Position in Current Research
GSRA is a lightweight, modular plug-in for Transformer-based image restoration architectures that resolves the inherent tension between locally sensitive geometry and globally robust semantics. By differential alignment—computing —PhaSR achieves state-of-the-art shadow removal, excelling both in direct and highly ambient lighting while maintaining lower system complexity relative to prior pipelines. The explicit mathematical formulation and empirically demonstrated benefits position GSRA as a general strategy for cross-modal attention rectification in spatially heterogeneous restoration tasks, and its efficacy has been confirmed by rigorous ablation and benchmarking protocols (Lee et al., 24 Jan 2026).