Scale-Aware Relay Layer (SARL)

Updated 20 November 2025

SARL is a neural architectural module that enhances scale-sensitive feature propagation by adapting quantization in wireless networks and employing cross-scale attention in vision tasks.
In Gaussian relay networks, SARL coarsens quantization based on network size to reduce description costs, achieving near cut-set bound performance with minimal CSI requirements.
For tiny object detection, SARL leverages cross-scale spatial-channel attention to preserve fine details in feature maps, leading to significant AP improvements in aerial imagery benchmarks.

The Scale-Aware Relay Layer (SARL) is a neural architectural module designed to enhance the representation and propagation of scale-sensitive features in deep learning systems. SARL addresses two distinct but conceptually related problems: (1) the efficient relaying of information in layered wireless communication networks, specifically Gaussian relay networks, and (2) the preservation and enhancement of tiny-object detail in deep convolutional networks for object detection, particularly in aerial imagery. In both contexts, SARL explicitly leverages scale information—either via quantization adapted to network size (communication) or via cross-scale spatial-channel attention (vision)—to mitigate losses inherent to naive multi-scale processing.

1. SARL in Layered Gaussian Relay Networks

In the information-theoretic domain, SARL arises from fundamental limitations of compress-and-forward (noisy network coding) schemes in networks where a message is relayed through multiple layers of nodes. Consider a wireless network with $D+1$ layers ( $i=0,\ldots,D$ ), where layer $0$ comprises $K$ single-antenna sources, each relay layer ( $1,\ldots,D-1$ ) contains $K$ single-antenna relays, and the destination (layer $D$ ) has $K$ antennas. Fast Rayleigh fading governs the channel, and additive white Gaussian noise with variance $\sigma^2$ impacts every receiver. The total number of relays is $N=K(D-1)$ (Kolte et al., 2013).

Classically, each relay quantizes its observation by adding independent Gaussian noise of variance $\sigma^2$ (the noise level) and forwards this quantization to the next layer. The achievable sum-rate for this strategy, using the noisy network coding formula,

$R \leq \min_{\Omega: s\in\Omega, d\in\Omega^c} \Bigl[ I(X_{\Omega}; \widehat{Y}_{\Omega^c} \mid X_{\Omega^c}, H) - I(Y_\Omega; \widehat{Y}_\Omega \mid X_N, \widehat{Y}_{\Omega^c}, H) \Bigr]$

suffers an $O(N)$ penalty compared to the cut-set bound, dominated by the cost of describing the many independent relay quantizations. This penalty emerges because each $I(Z;Z+\tilde{Z}) = \log(2)$ , and, in the worst cut, all $N$ relays may be separated (Kolte et al., 2013).

The SARL design rule, revealed by Kolte and Özgür, is to adapt the quantization resolution $Q$ at each relay according to network size: set $Q = (N-1)\sigma^2$ . This coarsens each relay's quantization and reduces the total description cost to a constant independent of $N$ , while incurring only an $O(\log N)$ degradation in cut-set mutual information:

First term: Each cut incurs a noise penalty up to $K\log N$ bits (per MIMO block).
Second term: $I(Y_{\Omega}; \widehat{Y}_{\Omega} \mid \cdots) \leq 1$ bit. Net sum-rate matches the cut-set bound within $O(K\log N)$ uniformly over network size (Kolte et al., 2013). No instantaneous CSI is needed at relays or sources; only the destination requires this for decoding.

2. SARL in Tiny Object Detection

In computer vision, SARL is applied as a cross-scale module placed between the backbone and neck of anchor-based (e.g., YOLOv5) and anchor-free (e.g., YOLOx) detectors to enhance detection of tiny objects in aerial images. Traditional feature pyramid networks (FPN), while effective for generic objects, propagate feature maps (e.g., C $_3$ , C $_4$ , C $_5$ ) through repeated up/downsampling. This process destroys or diffuses the fine-grained spatial information necessary for discriminating small objects—especially those mapped to $8\times8$ or $16\times16$ pixels. SARL relays the most discriminative channel and spatial information from lower to higher layers immediately before strided operations, using attention to select informative content (Li et al., 13 Nov 2025).

3. Architectural Formulation

SARL Block Placement:

SARL is inserted after the convolutional backbone, processing feature map pairs $(F_\ell, F_{\ell+1})$ via sequential “relay blocks.” The processed outputs are then supplied to the neck (e.g., PANet), replacing the unprocessed backbone features (Li et al., 13 Nov 2025).

Channel Attention:

Let $F^{\text{l}} \in \mathbb{R}^{C\times H\times W}$ (finer) and $F^{\text{h}}\in\mathbb{R}^{C\times h\times w}$ (coarser, $h=H/2$ ). Upsample $F^{\text{h}}$ to match resolution:

Concatenate global average pooling from both maps:

$z = [\text{GP}(F^{\text{l}}); \text{GP}(F^{\text{h}}\!\uparrow)] \in \mathbb{R}^{2C\times1\times1}$

Two-layer MLP (with reduction ratio $r$ , e.g., $r=16$ ):

$s = W_2 \delta (W_1 z)$

$A_c = \sigma(s) \in \mathbb{R}^{C\times1\times1}$

$A_c$ gates each channel of $F^{\text{l}}$ .

Spatial Attention:

With channel-selective maps, concatenate:

$U = [A_c \odot F^{\text{l}}; A_c \odot F^{\text{h}}\!\uparrow ] \in \mathbb{R}^{2C\times H\times W}$

Apply $3\times3$ convolution and softmax over spatial positions to obtain $A_s \in \mathbb{R}^{1\times H\times W}$ .

Fusion:

Relay output is given by

$F_{\text{out}} = A_c \odot A_s \odot F^{\text{l}} + F^{\text{h}}\!\uparrow$

where $F^{\text{h}}\!\uparrow$ is the upsampled coarser map (optionally, a $1\times1$ -projected version).

Hyperparameters: Channel reduction ratio $r=16$ ; channel-MLP kernels $1\times1$ ; spatial kernel $3\times3$ ; ReLU activations, sigmoid gating; nearest-neighbor upsampling by $2\times$ ; per-block parameter budget $\approx 0.2$ M for $C=256$ (Li et al., 13 Nov 2025).

4. Integration into Detection Frameworks and Computational Analysis

Anchor-based (YOLOv5): SARL blocks replace raw C $_3$ \to $C$ _4 $and C$ _4 $\to$ C $_5$ edges at the backbone-to-neck interface. No changes required for anchors or label assignment.

Anchor-free (YOLOx): SARL is inserted identically between the backbone and PANet neck. The label assignment (center-based) and IoU-based loss require no adaptation.

Computational Overhead: Each SARL block introduces $\leq 0.2$ M extra parameters (for $C=256$ ) and incurs $2$– $3\%$ additional FLOPs over the neck. Inference latency increases by approximately $2$ ms per $640\times640$ image on an RTX3080Ti. Relative AP gain diminishes with small input resolutions $(<320)$ , as fine resolution is lacking (Li et al., 13 Nov 2025).

5. Empirical Performance and Benchmarking

Extensive ablation studies demonstrate SARL’s efficacy:

YOLOx + SARL alone: On AI-TOD, AP increases from $22.2$ to $24.0$ ( $+1.8$ AP).
SAL (Scale-Adaptive Loss) alone: AP rises to $23.7$ ( $+1.5$ AP).
Combined: $+4.4$ AP over the baseline ($26.6$ AP) (Li et al., 13 Nov 2025).

On VisDrone2019, AP rises $2.2$ points, on DOTA-v2.0 $0.6$ points, and on AI-TOD-v2 $2.6$ points. The cumulative improvement in the YOLOv5 and YOLOx baselines is $5.5\%$ AP (generalization), and $29.0\%$ AP is achieved on the challenging AI-TOD-v2.0 benchmark. These results highlight robust gains on diverse and noisy aerial datasets.

6. Theoretical and Practical Significance

SARL, both in the communication and vision contexts, embodies the principle of scale-adaptation. In Gaussian relay networks, it rigorously demonstrates that naive strategies (quantizing at the noise level) are highly suboptimal in large networks, and that adapting relay operations to total network size collapses an otherwise linear penalty to a logarithmic one (Kolte et al., 2013). In computer vision, SARL directly remedies detail-loss from multi-scale propagation by explicit, attention-driven cross-scale relaying, providing measurable improvements for the detection of tiny objects—an area where traditional FPN/PAN schemes are weakest (Li et al., 13 Nov 2025).

A plausible implication is that similar scale-aware relay principles can generalize to other hierarchical systems, including graph neural networks and spatiotemporal processing pipelines. However, trade-offs include slightly higher memory footprint and reduced effectiveness at low spatial resolutions, indicating the importance of base feature map fidelity to realize the full benefit of SARL (Li et al., 13 Nov 2025).