Residual Feature Aggregator (RFA)
- Residual Feature Aggregator (RFA) is a fusion module that combines low-resolution structural details with high-frequency reference textures.
- It uses channel-wise fusion, Swin Transformer layers, and a residual pathway within a U-Net-style architecture to integrate multi-scale context.
- Empirical results demonstrate that employing RFA yields improved PSNR/SSIM compared to conventional convolution-only fusion approaches.
The Residual Feature Aggregator (RFA) is a feature-fusion module in the multi-scale deformable attention transformer (DATSR) architecture for reference-based image super-resolution (RefSR). RFA integrates local structure from low-resolution (LR) images and non-local texture details transferred from reference imagery via deformable attention, employing channel-wise fusion, Swin Transformer blocks for long-range context, and a residual pathway. This design enables adaptive blending of information at every stage in a U-Net-style encoder–decoder pipeline, preserving the robustness of LR features while maximizing contextual correspondence for high-fidelity super-resolution outputs (Cao et al., 2022).
1. Role in Multi-Scale DATSR Architecture
Within DATSR, RFA operates at each scale between encoder and decoder branches of the U-Net. Two parallel streams reach RFA at every scale: the LR feature map , and the transferred texture feature map produced by the reference-based deformable attention module . RFA fuses these into a single feature map for the subsequent stage, ensuring both spatial structure (from LR) and fine textures (from references) are preserved and contextually integrated.
At the terminal scale , RFA’s output is sent to the reconstruction head; in parallel, a global skip connection adds the bicubic upsampling of the LR input, ensuring faithful super-resolution recovery.
2. Mathematical Formulation
RFA’s operation is precisely defined for each scale: Here, denotes channel-wise concatenation; and 0 are 1 convolutions (the first followed by ReLU); 2 is a stack of Swin Transformer layers; and the outer skip connection ensures preservation of fused features. At the final stage: 3 where 4 is the bicubically upsampled LR input.
3. Internal Layer Composition and Implementation
RFA consists of the following sequential blocks at each scale:
- Conv 1: 5 convolution (stride 1, channels 6 in/out) with ReLU, fusing concatenated LR and attention features.
- Swin Transformer Stack: 7 windowed multi-head self-attention blocks with MLP, residuals, and LayerNorm, enabling long-range interactions across the feature spatial domain.
- Residual Addition: Output of Conv 1 (8) is added to the Swin stack output (9).
- Conv 2: 0 convolution (stride 1, channels 1; no activation), producing 2 for subsequent stages.
Pseudocode for the RFA layer: 8
4. Design Rationale
The architectural motivation arises from the complementary natures of the two input streams:
- The LR path (3) conveys global spatial structure and semantic integrity.
- The transferred texture feature (4) injects fine detail from reference imagery.
while simple fusion by concatenation and convolution can capture local correlations, it cannot exploit long-range dependencies vital for matching and blending reference textures. The inclusion of Swin Transformer blocks (windowed self-attention mechanism) inside RFA enables modeling of non-local context, facilitating context-aware merging of structure and detail.
The specific outer residual pathway 5 safeguards against transformer-induced artifacts (e.g., “hallucinated” textures), always preserving fused local features. This distinguishes RFA from standard conv-only fusion or ResNet blocks by ensuring both robustness and improved global consistency at each multi-scale step.
5. Quantitative Ablations and Empirical Impact
Systematic ablations validate the crucial contribution of RFA, particularly the Swin Transformer sub-block. Table 6 summarizes CUFED5-valid PSNR/SSIM for key variants under identical training.
| Variant | PSNR (dB) | SSIM |
|---|---|---|
| RDA w/ ordinary feature-warping only | 28.25 | 0.844 |
| RFA w/ plain ResNet blocks | 28.50 | 0.850 |
| Full DATSR (RDA + RFA) | 28.72 | 0.856 |
Replacing RFA’s Swin Transformer layers with 3×3 convolutional ResNet blocks yields a 6 dB drop in PSNR. The full pipeline with deformable attention and transformer-based RFA gains 7 dB PSNR over simple feature warping, establishing RFA as a decisive factor in super-resolution performance. This suggests that windowed self-attention within RFA realizes context-dependent fusion at a level unattainable by convolution-only structures (Cao et al., 2022).
6. Integration and Adaptation in Multi-Scale Super-Resolution
RFA is instantiated at each encoder–decoder scale, recursively blending progressively upsampled LR features and reference textures as resolution increases. This multi-scale aggregation adapts texture blending across spatial resolutions, accounting for scale variance in reference–target mismatch and allowing incremental refinement of fine detail. The final output combines the most contextually consistent and visually plausible information across all scales, together with a global skip to guarantee structural alignment with the input.
A plausible implication is that RFA’s architecture would generalize to other multi-stream, multi-scale fusion problems in vision where robustness, context, and preservation of backbone features are all necessary. Its lightweight design (using a combination of small convs and windowed transformers rather than full self-attention) offers a tractable approach for high-resolution tasks where memory and speed are important.