Residual Feature Aggregator (RFA)

Updated 31 May 2026

Residual Feature Aggregator (RFA) is a fusion module that combines low-resolution structural details with high-frequency reference textures.
It uses channel-wise fusion, Swin Transformer layers, and a residual pathway within a U-Net-style architecture to integrate multi-scale context.
Empirical results demonstrate that employing RFA yields improved PSNR/SSIM compared to conventional convolution-only fusion approaches.

The Residual Feature Aggregator (RFA) is a feature-fusion module in the multi-scale deformable attention transformer (DATSR) architecture for reference-based image super-resolution (RefSR). RFA integrates local structure from low-resolution (LR) images and non-local texture details transferred from reference imagery via deformable attention, employing channel-wise fusion, Swin Transformer blocks for long-range context, and a residual pathway. This design enables adaptive blending of information at every stage in a U-Net-style encoder–decoder pipeline, preserving the robustness of LR features while maximizing contextual correspondence for high-fidelity super-resolution outputs (Cao et al., 2022).

1. Role in Multi-Scale DATSR Architecture

Within DATSR, RFA operates at each scale $l \in \{1,2,\dots,L\}$ between encoder and decoder branches of the U-Net. Two parallel streams reach RFA at every scale: the LR feature map $\mathbf{F}_l \in \mathbb{R}^{C \times H_l \times W_l}$ , and the transferred texture feature map produced by the reference-based deformable attention module $\mathbf{A}_l \in \mathbb{R}^{C \times H_l \times W_l}$ . RFA fuses these into a single feature map $\mathbf{F}_{l+1}$ for the subsequent stage, ensuring both spatial structure (from LR) and fine textures (from references) are preserved and contextually integrated.

At the terminal scale $L$ , RFA’s output $\mathbf{F}_L$ is sent to the reconstruction head; in parallel, a global skip connection adds the bicubic upsampling of the LR input, ensuring faithful $4\times$ super-resolution recovery.

2. Mathematical Formulation

RFA’s operation is precisely defined for each scale: $\begin{aligned} \mathbf{U}_l &= \mathrm{Conv}_1\bigl(\mathbf{F}_l \oplus \mathbf{A}_l\bigr), \ \mathbf{V}_l &= \mathrm{STL}(\mathbf{U}_l), \ \mathbf{Z}_l &= \mathbf{U}_l + \mathbf{V}_l, \ \mathbf{F}_{l+1} &= \mathrm{Conv}_2(\mathbf{Z}_l). \end{aligned}$ Here, $\oplus$ denotes channel-wise concatenation; $\mathrm{Conv}_1$ and $\mathbf{F}_l \in \mathbb{R}^{C \times H_l \times W_l}$ 0 are $\mathbf{F}_l \in \mathbb{R}^{C \times H_l \times W_l}$ 1 convolutions (the first followed by ReLU); $\mathbf{F}_l \in \mathbb{R}^{C \times H_l \times W_l}$ 2 is a stack of Swin Transformer layers; and the outer skip connection ensures preservation of fused features. At the final stage: $\mathbf{F}_l \in \mathbb{R}^{C \times H_l \times W_l}$ 3 where $\mathbf{F}_l \in \mathbb{R}^{C \times H_l \times W_l}$ 4 is the bicubically upsampled LR input.

3. Internal Layer Composition and Implementation

RFA consists of the following sequential blocks at each scale:

Conv 1: $\mathbf{F}_l \in \mathbb{R}^{C \times H_l \times W_l}$ 5 convolution (stride 1, channels $\mathbf{F}_l \in \mathbb{R}^{C \times H_l \times W_l}$ 6 in/out) with ReLU, fusing concatenated LR and attention features.
Swin Transformer Stack: $\mathbf{F}_l \in \mathbb{R}^{C \times H_l \times W_l}$ 7 windowed multi-head self-attention blocks with MLP, residuals, and LayerNorm, enabling long-range interactions across the feature spatial domain.
Residual Addition: Output of Conv 1 ( $\mathbf{F}_l \in \mathbb{R}^{C \times H_l \times W_l}$ 8) is added to the Swin stack output ( $\mathbf{F}_l \in \mathbb{R}^{C \times H_l \times W_l}$ 9).
Conv 2: $\mathbf{A}_l \in \mathbb{R}^{C \times H_l \times W_l}$ 0 convolution (stride 1, channels $\mathbf{A}_l \in \mathbb{R}^{C \times H_l \times W_l}$ 1; no activation), producing $\mathbf{A}_l \in \mathbb{R}^{C \times H_l \times W_l}$ 2 for subsequent stages.

Pseudocode for the RFA layer: $\mathbf{A}_l \in \mathbb{R}^{C \times H_l \times W_l}$ 8

4. Design Rationale

The architectural motivation arises from the complementary natures of the two input streams:

The LR path ( $\mathbf{A}_l \in \mathbb{R}^{C \times H_l \times W_l}$ 3) conveys global spatial structure and semantic integrity.
The transferred texture feature ( $\mathbf{A}_l \in \mathbb{R}^{C \times H_l \times W_l}$ 4) injects fine detail from reference imagery.

while simple fusion by concatenation and convolution can capture local correlations, it cannot exploit long-range dependencies vital for matching and blending reference textures. The inclusion of Swin Transformer blocks (windowed self-attention mechanism) inside RFA enables modeling of non-local context, facilitating context-aware merging of structure and detail.

The specific outer residual pathway $\mathbf{A}_l \in \mathbb{R}^{C \times H_l \times W_l}$ 5 safeguards against transformer-induced artifacts (e.g., “hallucinated” textures), always preserving fused local features. This distinguishes RFA from standard conv-only fusion or ResNet blocks by ensuring both robustness and improved global consistency at each multi-scale step.

5. Quantitative Ablations and Empirical Impact

Systematic ablations validate the crucial contribution of RFA, particularly the Swin Transformer sub-block. Table 6 summarizes CUFED5-valid PSNR/SSIM for key variants under identical training.

Variant	PSNR (dB)	SSIM
RDA w/ ordinary feature-warping only	28.25	0.844
RFA w/ plain ResNet blocks	28.50	0.850
Full DATSR (RDA + RFA)	28.72	0.856

Replacing RFA’s Swin Transformer layers with 3×3 convolutional ResNet blocks yields a $\mathbf{A}_l \in \mathbb{R}^{C \times H_l \times W_l}$ 6 dB drop in PSNR. The full pipeline with deformable attention and transformer-based RFA gains $\mathbf{A}_l \in \mathbb{R}^{C \times H_l \times W_l}$ 7 dB PSNR over simple feature warping, establishing RFA as a decisive factor in super-resolution performance. This suggests that windowed self-attention within RFA realizes context-dependent fusion at a level unattainable by convolution-only structures (Cao et al., 2022).

6. Integration and Adaptation in Multi-Scale Super-Resolution

RFA is instantiated at each encoder–decoder scale, recursively blending progressively upsampled LR features and reference textures as resolution increases. This multi-scale aggregation adapts texture blending across spatial resolutions, accounting for scale variance in reference–target mismatch and allowing incremental refinement of fine detail. The final output combines the most contextually consistent and visually plausible information across all scales, together with a global skip to guarantee structural alignment with the input.

A plausible implication is that RFA’s architecture would generalize to other multi-stream, multi-scale fusion problems in vision where robustness, context, and preservation of backbone features are all necessary. Its lightweight design (using a combination of small convs and windowed transformers rather than full self-attention) offers a tractable approach for high-resolution tasks where memory and speed are important.

Markdown Report Issue Upgrade to Chat

References (1)

Reference-based Image Super-Resolution with Deformable Attention Transformer (2022)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Residual Features Amplifier (RFA).

Residual Feature Aggregator (RFA)

1. Role in Multi-Scale DATSR Architecture

2. Mathematical Formulation

3. Internal Layer Composition and Implementation

4. Design Rationale

5. Quantitative Ablations and Empirical Impact

6. Integration and Adaptation in Multi-Scale Super-Resolution

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Don't miss out on important new AI/ML research

Residual Feature Aggregator (RFA)

1. Role in Multi-Scale DATSR Architecture

2. Mathematical Formulation

3. Internal Layer Composition and Implementation

4. Design Rationale

5. Quantitative Ablations and Empirical Impact

6. Integration and Adaptation in Multi-Scale Super-Resolution

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Related Topics

Don't miss out on important new AI/ML research

Sign up for free to explore the frontiers of research