Papers
Topics
Authors
Recent
Search
2000 character limit reached

Residual Feature Aggregator (RFA)

Updated 31 May 2026
  • Residual Feature Aggregator (RFA) is a fusion module that combines low-resolution structural details with high-frequency reference textures.
  • It uses channel-wise fusion, Swin Transformer layers, and a residual pathway within a U-Net-style architecture to integrate multi-scale context.
  • Empirical results demonstrate that employing RFA yields improved PSNR/SSIM compared to conventional convolution-only fusion approaches.

The Residual Feature Aggregator (RFA) is a feature-fusion module in the multi-scale deformable attention transformer (DATSR) architecture for reference-based image super-resolution (RefSR). RFA integrates local structure from low-resolution (LR) images and non-local texture details transferred from reference imagery via deformable attention, employing channel-wise fusion, Swin Transformer blocks for long-range context, and a residual pathway. This design enables adaptive blending of information at every stage in a U-Net-style encoder–decoder pipeline, preserving the robustness of LR features while maximizing contextual correspondence for high-fidelity super-resolution outputs (Cao et al., 2022).

1. Role in Multi-Scale DATSR Architecture

Within DATSR, RFA operates at each scale l{1,2,,L}l \in \{1,2,\dots,L\} between encoder and decoder branches of the U-Net. Two parallel streams reach RFA at every scale: the LR feature map FlRC×Hl×Wl\mathbf{F}_l \in \mathbb{R}^{C \times H_l \times W_l}, and the transferred texture feature map produced by the reference-based deformable attention module AlRC×Hl×Wl\mathbf{A}_l \in \mathbb{R}^{C \times H_l \times W_l}. RFA fuses these into a single feature map Fl+1\mathbf{F}_{l+1} for the subsequent stage, ensuring both spatial structure (from LR) and fine textures (from references) are preserved and contextually integrated.

At the terminal scale LL, RFA’s output FL\mathbf{F}_L is sent to the reconstruction head; in parallel, a global skip connection adds the bicubic upsampling of the LR input, ensuring faithful 4×4\times super-resolution recovery.

2. Mathematical Formulation

RFA’s operation is precisely defined for each scale: Ul=Conv1(FlAl), Vl=STL(Ul), Zl=Ul+Vl, Fl+1=Conv2(Zl).\begin{aligned} \mathbf{U}_l &= \mathrm{Conv}_1\bigl(\mathbf{F}_l \oplus \mathbf{A}_l\bigr), \ \mathbf{V}_l &= \mathrm{STL}(\mathbf{U}_l), \ \mathbf{Z}_l &= \mathbf{U}_l + \mathbf{V}_l, \ \mathbf{F}_{l+1} &= \mathrm{Conv}_2(\mathbf{Z}_l). \end{aligned} Here, \oplus denotes channel-wise concatenation; Conv1\mathrm{Conv}_1 and FlRC×Hl×Wl\mathbf{F}_l \in \mathbb{R}^{C \times H_l \times W_l}0 are FlRC×Hl×Wl\mathbf{F}_l \in \mathbb{R}^{C \times H_l \times W_l}1 convolutions (the first followed by ReLU); FlRC×Hl×Wl\mathbf{F}_l \in \mathbb{R}^{C \times H_l \times W_l}2 is a stack of Swin Transformer layers; and the outer skip connection ensures preservation of fused features. At the final stage: FlRC×Hl×Wl\mathbf{F}_l \in \mathbb{R}^{C \times H_l \times W_l}3 where FlRC×Hl×Wl\mathbf{F}_l \in \mathbb{R}^{C \times H_l \times W_l}4 is the bicubically upsampled LR input.

3. Internal Layer Composition and Implementation

RFA consists of the following sequential blocks at each scale:

  • Conv 1: FlRC×Hl×Wl\mathbf{F}_l \in \mathbb{R}^{C \times H_l \times W_l}5 convolution (stride 1, channels FlRC×Hl×Wl\mathbf{F}_l \in \mathbb{R}^{C \times H_l \times W_l}6 in/out) with ReLU, fusing concatenated LR and attention features.
  • Swin Transformer Stack: FlRC×Hl×Wl\mathbf{F}_l \in \mathbb{R}^{C \times H_l \times W_l}7 windowed multi-head self-attention blocks with MLP, residuals, and LayerNorm, enabling long-range interactions across the feature spatial domain.
  • Residual Addition: Output of Conv 1 (FlRC×Hl×Wl\mathbf{F}_l \in \mathbb{R}^{C \times H_l \times W_l}8) is added to the Swin stack output (FlRC×Hl×Wl\mathbf{F}_l \in \mathbb{R}^{C \times H_l \times W_l}9).
  • Conv 2: AlRC×Hl×Wl\mathbf{A}_l \in \mathbb{R}^{C \times H_l \times W_l}0 convolution (stride 1, channels AlRC×Hl×Wl\mathbf{A}_l \in \mathbb{R}^{C \times H_l \times W_l}1; no activation), producing AlRC×Hl×Wl\mathbf{A}_l \in \mathbb{R}^{C \times H_l \times W_l}2 for subsequent stages.

Pseudocode for the RFA layer: AlRC×Hl×Wl\mathbf{A}_l \in \mathbb{R}^{C \times H_l \times W_l}8

4. Design Rationale

The architectural motivation arises from the complementary natures of the two input streams:

  • The LR path (AlRC×Hl×Wl\mathbf{A}_l \in \mathbb{R}^{C \times H_l \times W_l}3) conveys global spatial structure and semantic integrity.
  • The transferred texture feature (AlRC×Hl×Wl\mathbf{A}_l \in \mathbb{R}^{C \times H_l \times W_l}4) injects fine detail from reference imagery.

while simple fusion by concatenation and convolution can capture local correlations, it cannot exploit long-range dependencies vital for matching and blending reference textures. The inclusion of Swin Transformer blocks (windowed self-attention mechanism) inside RFA enables modeling of non-local context, facilitating context-aware merging of structure and detail.

The specific outer residual pathway AlRC×Hl×Wl\mathbf{A}_l \in \mathbb{R}^{C \times H_l \times W_l}5 safeguards against transformer-induced artifacts (e.g., “hallucinated” textures), always preserving fused local features. This distinguishes RFA from standard conv-only fusion or ResNet blocks by ensuring both robustness and improved global consistency at each multi-scale step.

5. Quantitative Ablations and Empirical Impact

Systematic ablations validate the crucial contribution of RFA, particularly the Swin Transformer sub-block. Table 6 summarizes CUFED5-valid PSNR/SSIM for key variants under identical training.

Variant PSNR (dB) SSIM
RDA w/ ordinary feature-warping only 28.25 0.844
RFA w/ plain ResNet blocks 28.50 0.850
Full DATSR (RDA + RFA) 28.72 0.856

Replacing RFA’s Swin Transformer layers with 3×3 convolutional ResNet blocks yields a AlRC×Hl×Wl\mathbf{A}_l \in \mathbb{R}^{C \times H_l \times W_l}6 dB drop in PSNR. The full pipeline with deformable attention and transformer-based RFA gains AlRC×Hl×Wl\mathbf{A}_l \in \mathbb{R}^{C \times H_l \times W_l}7 dB PSNR over simple feature warping, establishing RFA as a decisive factor in super-resolution performance. This suggests that windowed self-attention within RFA realizes context-dependent fusion at a level unattainable by convolution-only structures (Cao et al., 2022).

6. Integration and Adaptation in Multi-Scale Super-Resolution

RFA is instantiated at each encoder–decoder scale, recursively blending progressively upsampled LR features and reference textures as resolution increases. This multi-scale aggregation adapts texture blending across spatial resolutions, accounting for scale variance in reference–target mismatch and allowing incremental refinement of fine detail. The final output combines the most contextually consistent and visually plausible information across all scales, together with a global skip to guarantee structural alignment with the input.

A plausible implication is that RFA’s architecture would generalize to other multi-stream, multi-scale fusion problems in vision where robustness, context, and preservation of backbone features are all necessary. Its lightweight design (using a combination of small convs and windowed transformers rather than full self-attention) offers a tractable approach for high-resolution tasks where memory and speed are important.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Residual Features Amplifier (RFA).