GSRA: Geometric-Semantic Rectification Attention

Updated 31 January 2026

The paper introduces GSRA as a cross-modal attention module that efficiently fuses geometry and semantic features to enhance shadow removal performance.
It details a multi-stage Transformer-based design that aligns depth/normal cues with semantic embeddings via differential rectification to improve local edge precision while reducing noise.
Empirical evaluations demonstrate that GSRA significantly boosts PSNR and SSIM in shadow removal tasks across both single-source and multi-source illumination scenarios.

Geometric-Semantic Rectification Attention (GSRA) is a cross-modal attention mechanism developed for robust shadow removal under single and multi-source illumination. Introduced within the PhaSR (Physically Aligned Shadow Removal) framework, GSRA achieves local harmonization of geometry- and semantics-derived features by differential rectification, producing feature maps that are both spatially precise at shadow edges and semantically coherent within homogeneous regions. Its design addresses the challenge of modal misalignment between geometry features (which are sensitive but noisy) and semantic embeddings (which are robust yet coarse), enabling advanced shadow removal where prior methods are limited by direct-illumination or global-texture assumptions (Lee et al., 24 Jan 2026).

1. Functional Role in the PhaSR Framework

GSRA operates in the second stage of the two-step PhaSR pipeline. Following Physically Aligned Normalization (PAN)—which executes closed-form illumination correction—GSRA is embedded within a Transformer encoder-decoder to perform local, scale-adaptive cross-modal alignment. Specifically, GSRA fuses depth and normal features extracted using DepthAnything-V2 (geometric priors) with semantic embeddings computed by DINO-v2 (semantic priors). The geometric features exhibit strong edge sensitivity but produce noise in low-gradient regions, while the semantic features provide globally stable descriptors but fail to sharply resolve illumination boundaries. By rectifying semantic attention via differential subtraction of geometry-derived attention, GSRA resolves cross-modal conflicts that arise in heterogeneous illumination settings.

2. Architectural Design and Module Flow

GSRA comprises three principal sub-components at each multi-scale decoding stage (and optionally at the deepest encoder layer):

Prior Injection: The shared query features $F_{\text{input}} \in \mathbb{R}^{H \times W \times C}$ are augmented with geometric and semantic priors:
- $F_{\text{geo}}' = F_{\text{input}} + \alpha_{\text{geo}} F_{\text{geo}}$
- $F_{\text{sem}}' = F_{\text{input}} + \alpha_{\text{sem}} F_{\text{sem}}$ Here, $F_{\text{geo}}, F_{\text{sem}}$ are depth/normal and semantic features, respectively, projected to a common channel dimension, and $\alpha_{\text{geo}}, \alpha_{\text{sem}}$ are learned scalars controlling the injection strength.
Key–Value Projections: The injected features are linearly projected into keys and values for each modality:
- $\{K_{\text{geo}}, V_{\text{geo}}\} = \mathcal{F}_{\text{geo}}(F_{\text{geo}}')$
- $\{K_{\text{sem}}, V_{\text{sem}}\} = \mathcal{F}_{\text{sem}}(F_{\text{sem}}')$
Differential Attention and Fusion:
- With shared queries $Q = \text{Linear}_q(F_{\text{input}})$ $Q = Linear_{q} (F_{input})$ , the attention maps are:
  - $A_{\text{geo}} = \text{Softmax}((Q K_{\text{geo}}^\top) / \sqrt{d} + B)$
  - $A_{\text{sem}} = \text{Softmax}((Q K_{\text{sem}}^\top) / \sqrt{d} + B)$
- The rectified attention is then computed as $A_{\text{rect}} = A_{\text{sem}} - \lambda A_{\text{geo}}$ , with $\lambda$ a learnable, possibly scale-specific scalar.
- Output fusion occurs as $F_{\text{out}} = \text{Concat}(A_{\text{rect}} V_{\text{geo}},\ A_{\text{rect}} V_{\text{sem}})$ .

The fused output is subsequently processed by Transformer blocks (TDB/PATB) before upsampling or bypassing to higher resolutions.

3. Mathematical Formulation

The GSRA mechanism is fully governed by the following equations:

Step	Mathematical Expression	Description
Prior Injection	$F_{\text{geo}}' = F_{\text{input}} + \alpha_{\text{geo}} F_{\text{geo}}$ <br> $F_{\text{sem}}' = F_{\text{input}} + \alpha_{\text{sem}} F_{\text{sem}}$	Incorporate geometry and semantic priors into the query stream
KV Projection	$\{K_{\text{geo}}, V_{\text{geo}}\} = \mathcal{F}_{\text{geo}}(F_{\text{geo}}')$ <br> $\{K_{\text{sem}}, V_{\text{sem}}\} = \mathcal{F}_{\text{sem}}(F_{\text{sem}}')$	Modality-specific KV computation
Attention Maps	$A_{\text{geo}} = \text{Softmax}\left(\frac{Q K_{\text{geo}}^\top}{\sqrt{d}} + B\right)$ <br> $A_{\text{sem}} = \text{Softmax}\left(\frac{Q K_{\text{sem}}^\top}{\sqrt{d}} + B\right)$	Separate geometry and semantic attention weights
Differential Rect.	$A_{\text{rect}} = A_{\text{sem}} - \lambda A_{\text{geo}}$	Semantic rectification by geometry gating
Output Fusion	$F_{\text{out}} = \text{Concat}(A_{\text{rect}} V_{\text{geo}},\ A_{\text{rect}} V_{\text{sem}})$	Final cross-modal fused output

No exclusive GSRA loss or explicit alignment regularizer is used; all alignment is achieved via end-to-end backpropagation of the overall reconstruction losses, namely $\mathcal{L}_{\text{total}} = \lambda_C \mathcal{L}_{\text{Charb}} + \lambda_S \mathcal{L}_{\text{SSIM}}$ .

GSRA directly addresses the conflicting behaviors of geometry and semantic streams in the attention mechanism. Geometry features, while excelling at highlighting shading and edge gradients, are prone to overreact to any local intensity change, resulting in noise within untextured regions. Semantic features offer global consistency and robustness to illumination changes but fail to resolve fine-grained shadow boundaries. By learning a scale-dependent $\lambda$ , GSRA adaptively subtracts geometry attention from the semantic map:

In uniform (smooth) areas, $A_{\text{geo}}$ is minimal, yielding $A_{\text{rect}} \approx A_{\text{sem}}$ , thus semantic features dominate and preserve smoothness.
At true shadow boundaries, spikes in $A_{\text{geo}}$ induce $A_{\text{rect}}$ to sharpen attention, enhancing localization of boundary features.

This selective gating allows GSRA to maintain texture integrity in interiors while retaining precise localization at transitions, resolving the modal misalignment without reliance on hand-engineered rules or handcrafted fusion heuristics.

5. Training Objectives and Implementation Details

GSRA is trained within PhaSR under the standard image reconstruction objectives:

Charbonnier loss ( $\mathcal{L}_{\text{Charb}}$ ) penalizes photometric differences.
SSIM loss ( $\mathcal{L}_{\text{SSIM}}$ ) enforces structural similarity.

No additional loss is imposed on GSRA or its alignment outputs. All module parameters, including $\alpha_{\text{geo}}, \alpha_{\text{sem}}$ , and $\lambda$ , are optimized jointly with the rest of the framework by minimizing total loss on shadow and reflectance reconstruction.

6. Empirical Evaluation and Quantitative Impact

Ablation studies confirm the critical role of GSRA in successful shadow removal across direct and ambient lighting. On the ISTD+ dataset, the full model (PAN+GSRA) achieves PSNR = 34.48 dB and SSIM = 0.960. Replacing GSRA with ordinary cross-attention results in a drop to PSNR = 32.56 / SSIM = 0.934 (−1.92 dB, −0.026). Setting $\lambda=0$ (i.e., no rectification) yields intermediate results (PSNR = 32.89 / SSIM = 0.951). Ablating either prior (setting $\alpha_{\text{geo}}$ or $\alpha_{\text{sem}}$ to zero) degrades performance by approximately 1 dB. On WSRD+ (indoor, single-light), GSRA contributes about 1.5 dB PSNR gain and 0.02 SSIM improvement over vanilla attention fusion.

When evaluated on Ambient6K—a multi-source, ambient illumination shadow normalization set—GSRA-equipped PhaSR achieves SSIM = 0.834 and PSNR = 23.32 dB, surpassing geometry-only and semantic-only models such as DenseSR (22.78 / 0.830) and OmniSR (23.01 / 0.832). Cross-dataset analysis reveals that GSRA’s explicit cross-modal rectification yields reduced degradation in domain transfer (from single-source to multi-source shadowing), underscoring its effectiveness in generalizing to complex, colored shadow scenarios.

7. Summary and Position in Current Research

GSRA is a lightweight, modular plug-in for Transformer-based image restoration architectures that resolves the inherent tension between locally sensitive geometry and globally robust semantics. By differential alignment—computing $A_{\text{rect}} = A_{\text{sem}} - \lambda A_{\text{geo}}$ —PhaSR achieves state-of-the-art shadow removal, excelling both in direct and highly ambient lighting while maintaining lower system complexity relative to prior pipelines. The explicit mathematical formulation and empirically demonstrated benefits position GSRA as a general strategy for cross-modal attention rectification in spatially heterogeneous restoration tasks, and its efficacy has been confirmed by rigorous ablation and benchmarking protocols (Lee et al., 24 Jan 2026).

Markdown Report Issue Upgrade to Chat

References (1)

PhaSR: Generalized Image Shadow Removal with Physically Aligned Priors (2026)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Geometric-Semantic Rectification Attention (GSRA).

GSRA: Geometric-Semantic Rectification Attention

1. Functional Role in the PhaSR Framework

2. Architectural Design and Module Flow

3. Mathematical Formulation

5. Training Objectives and Implementation Details

6. Empirical Evaluation and Quantitative Impact

7. Summary and Position in Current Research

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Don't miss out on important new AI/ML research

GSRA: Geometric-Semantic Rectification Attention

1. Functional Role in the PhaSR Framework

2. Architectural Design and Module Flow

3. Mathematical Formulation

4. Mechanism for Cross-Modal Conflict Resolution

5. Training Objectives and Implementation Details

6. Empirical Evaluation and Quantitative Impact

7. Summary and Position in Current Research

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Related Topics

Don't miss out on important new AI/ML research

Sign up for free to explore the frontiers of research