Papers
Topics
Authors
Recent
Search
2000 character limit reached

GSRA: Geometric-Semantic Rectification Attention

Updated 31 January 2026
  • The paper introduces GSRA as a cross-modal attention module that efficiently fuses geometry and semantic features to enhance shadow removal performance.
  • It details a multi-stage Transformer-based design that aligns depth/normal cues with semantic embeddings via differential rectification to improve local edge precision while reducing noise.
  • Empirical evaluations demonstrate that GSRA significantly boosts PSNR and SSIM in shadow removal tasks across both single-source and multi-source illumination scenarios.

Geometric-Semantic Rectification Attention (GSRA) is a cross-modal attention mechanism developed for robust shadow removal under single and multi-source illumination. Introduced within the PhaSR (Physically Aligned Shadow Removal) framework, GSRA achieves local harmonization of geometry- and semantics-derived features by differential rectification, producing feature maps that are both spatially precise at shadow edges and semantically coherent within homogeneous regions. Its design addresses the challenge of modal misalignment between geometry features (which are sensitive but noisy) and semantic embeddings (which are robust yet coarse), enabling advanced shadow removal where prior methods are limited by direct-illumination or global-texture assumptions (Lee et al., 24 Jan 2026).

1. Functional Role in the PhaSR Framework

GSRA operates in the second stage of the two-step PhaSR pipeline. Following Physically Aligned Normalization (PAN)—which executes closed-form illumination correction—GSRA is embedded within a Transformer encoder-decoder to perform local, scale-adaptive cross-modal alignment. Specifically, GSRA fuses depth and normal features extracted using DepthAnything-V2 (geometric priors) with semantic embeddings computed by DINO-v2 (semantic priors). The geometric features exhibit strong edge sensitivity but produce noise in low-gradient regions, while the semantic features provide globally stable descriptors but fail to sharply resolve illumination boundaries. By rectifying semantic attention via differential subtraction of geometry-derived attention, GSRA resolves cross-modal conflicts that arise in heterogeneous illumination settings.

2. Architectural Design and Module Flow

GSRA comprises three principal sub-components at each multi-scale decoding stage (and optionally at the deepest encoder layer):

  1. Prior Injection: The shared query features FinputRH×W×CF_{\text{input}} \in \mathbb{R}^{H \times W \times C} are augmented with geometric and semantic priors:
    • Fgeo=Finput+αgeoFgeoF_{\text{geo}}' = F_{\text{input}} + \alpha_{\text{geo}} F_{\text{geo}}
    • Fsem=Finput+αsemFsemF_{\text{sem}}' = F_{\text{input}} + \alpha_{\text{sem}} F_{\text{sem}} Here, Fgeo,FsemF_{\text{geo}}, F_{\text{sem}} are depth/normal and semantic features, respectively, projected to a common channel dimension, and αgeo,αsem\alpha_{\text{geo}}, \alpha_{\text{sem}} are learned scalars controlling the injection strength.
  2. Key–Value Projections: The injected features are linearly projected into keys and values for each modality:
    • {Kgeo,Vgeo}=Fgeo(Fgeo)\{K_{\text{geo}}, V_{\text{geo}}\} = \mathcal{F}_{\text{geo}}(F_{\text{geo}}')
    • {Ksem,Vsem}=Fsem(Fsem)\{K_{\text{sem}}, V_{\text{sem}}\} = \mathcal{F}_{\text{sem}}(F_{\text{sem}}')
  3. Differential Attention and Fusion:
    • With shared queries Q=Linearq(Finput)Q = \text{Linear}_q(F_{\text{input}}), the attention maps are:
      • Ageo=Softmax((QKgeo)/d+B)A_{\text{geo}} = \text{Softmax}((Q K_{\text{geo}}^\top) / \sqrt{d} + B)
      • Asem=Softmax((QKsem)/d+B)A_{\text{sem}} = \text{Softmax}((Q K_{\text{sem}}^\top) / \sqrt{d} + B)
    • The rectified attention is then computed as Arect=AsemλAgeoA_{\text{rect}} = A_{\text{sem}} - \lambda A_{\text{geo}}, with λ\lambda a learnable, possibly scale-specific scalar.
    • Output fusion occurs as Fout=Concat(ArectVgeo, ArectVsem)F_{\text{out}} = \text{Concat}(A_{\text{rect}} V_{\text{geo}},\ A_{\text{rect}} V_{\text{sem}}).

The fused output is subsequently processed by Transformer blocks (TDB/PATB) before upsampling or bypassing to higher resolutions.

3. Mathematical Formulation

The GSRA mechanism is fully governed by the following equations:

Step Mathematical Expression Description
Prior Injection Fgeo=Finput+αgeoFgeoF_{\text{geo}}' = F_{\text{input}} + \alpha_{\text{geo}} F_{\text{geo}}<br>Fsem=Finput+αsemFsemF_{\text{sem}}' = F_{\text{input}} + \alpha_{\text{sem}} F_{\text{sem}} Incorporate geometry and semantic priors into the query stream
KV Projection {Kgeo,Vgeo}=Fgeo(Fgeo)\{K_{\text{geo}}, V_{\text{geo}}\} = \mathcal{F}_{\text{geo}}(F_{\text{geo}}')<br> {Ksem,Vsem}=Fsem(Fsem)\{K_{\text{sem}}, V_{\text{sem}}\} = \mathcal{F}_{\text{sem}}(F_{\text{sem}}') Modality-specific KV computation
Attention Maps Ageo=Softmax(QKgeod+B)A_{\text{geo}} = \text{Softmax}\left(\frac{Q K_{\text{geo}}^\top}{\sqrt{d}} + B\right)<br>Asem=Softmax(QKsemd+B)A_{\text{sem}} = \text{Softmax}\left(\frac{Q K_{\text{sem}}^\top}{\sqrt{d}} + B\right) Separate geometry and semantic attention weights
Differential Rect. Arect=AsemλAgeoA_{\text{rect}} = A_{\text{sem}} - \lambda A_{\text{geo}} Semantic rectification by geometry gating
Output Fusion Fout=Concat(ArectVgeo, ArectVsem)F_{\text{out}} = \text{Concat}(A_{\text{rect}} V_{\text{geo}},\ A_{\text{rect}} V_{\text{sem}}) Final cross-modal fused output

No exclusive GSRA loss or explicit alignment regularizer is used; all alignment is achieved via end-to-end backpropagation of the overall reconstruction losses, namely Ltotal=λCLCharb+λSLSSIM\mathcal{L}_{\text{total}} = \lambda_C \mathcal{L}_{\text{Charb}} + \lambda_S \mathcal{L}_{\text{SSIM}}.

4. Mechanism for Cross-Modal Conflict Resolution

GSRA directly addresses the conflicting behaviors of geometry and semantic streams in the attention mechanism. Geometry features, while excelling at highlighting shading and edge gradients, are prone to overreact to any local intensity change, resulting in noise within untextured regions. Semantic features offer global consistency and robustness to illumination changes but fail to resolve fine-grained shadow boundaries. By learning a scale-dependent λ\lambda, GSRA adaptively subtracts geometry attention from the semantic map:

  • In uniform (smooth) areas, AgeoA_{\text{geo}} is minimal, yielding ArectAsemA_{\text{rect}} \approx A_{\text{sem}}, thus semantic features dominate and preserve smoothness.
  • At true shadow boundaries, spikes in AgeoA_{\text{geo}} induce ArectA_{\text{rect}} to sharpen attention, enhancing localization of boundary features.

This selective gating allows GSRA to maintain texture integrity in interiors while retaining precise localization at transitions, resolving the modal misalignment without reliance on hand-engineered rules or handcrafted fusion heuristics.

5. Training Objectives and Implementation Details

GSRA is trained within PhaSR under the standard image reconstruction objectives:

  • Charbonnier loss (LCharb\mathcal{L}_{\text{Charb}}) penalizes photometric differences.
  • SSIM loss (LSSIM\mathcal{L}_{\text{SSIM}}) enforces structural similarity.

No additional loss is imposed on GSRA or its alignment outputs. All module parameters, including αgeo,αsem\alpha_{\text{geo}}, \alpha_{\text{sem}}, and λ\lambda, are optimized jointly with the rest of the framework by minimizing total loss on shadow and reflectance reconstruction.

6. Empirical Evaluation and Quantitative Impact

Ablation studies confirm the critical role of GSRA in successful shadow removal across direct and ambient lighting. On the ISTD+ dataset, the full model (PAN+GSRA) achieves PSNR = 34.48 dB and SSIM = 0.960. Replacing GSRA with ordinary cross-attention results in a drop to PSNR = 32.56 / SSIM = 0.934 (−1.92 dB, −0.026). Setting λ=0\lambda=0 (i.e., no rectification) yields intermediate results (PSNR = 32.89 / SSIM = 0.951). Ablating either prior (setting αgeo\alpha_{\text{geo}} or αsem\alpha_{\text{sem}} to zero) degrades performance by approximately 1 dB. On WSRD+ (indoor, single-light), GSRA contributes about 1.5 dB PSNR gain and 0.02 SSIM improvement over vanilla attention fusion.

When evaluated on Ambient6K—a multi-source, ambient illumination shadow normalization set—GSRA-equipped PhaSR achieves SSIM = 0.834 and PSNR = 23.32 dB, surpassing geometry-only and semantic-only models such as DenseSR (22.78 / 0.830) and OmniSR (23.01 / 0.832). Cross-dataset analysis reveals that GSRA’s explicit cross-modal rectification yields reduced degradation in domain transfer (from single-source to multi-source shadowing), underscoring its effectiveness in generalizing to complex, colored shadow scenarios.

7. Summary and Position in Current Research

GSRA is a lightweight, modular plug-in for Transformer-based image restoration architectures that resolves the inherent tension between locally sensitive geometry and globally robust semantics. By differential alignment—computing Arect=AsemλAgeoA_{\text{rect}} = A_{\text{sem}} - \lambda A_{\text{geo}}—PhaSR achieves state-of-the-art shadow removal, excelling both in direct and highly ambient lighting while maintaining lower system complexity relative to prior pipelines. The explicit mathematical formulation and empirically demonstrated benefits position GSRA as a general strategy for cross-modal attention rectification in spatially heterogeneous restoration tasks, and its efficacy has been confirmed by rigorous ablation and benchmarking protocols (Lee et al., 24 Jan 2026).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Geometric-Semantic Rectification Attention (GSRA).