Papers
Topics
Authors
Recent
Search
2000 character limit reached

Aerial-Y-Net: Dual-Branch CNN for Aerial Scenes

Updated 2 February 2026
  • Aerial-Y-Net is a dual-branch CNN architecture for aerial scene classification that leverages spatial attention and multi-scale feature fusion.
  • It employs parallel ARNet branches with 3×3 and 5×5 convolutions and integrates FuSAM to recalibrate early features for enhanced discrimination.
  • Empirical results on the AID dataset show a 91.72% accuracy, outperforming models like VGG-16 and GoogLeNet, highlighting its robust performance.

Aerial-Y-Net is a dual-branch convolutional neural network architecture designed for aerial scene classification, with a focus on spatial attention and multi-scale feature fusion. Developed within the context of benchmarking performance on the AID dataset, Aerial-Y-Net integrates a two-stream backbone, an attention-driven fusion module, and an efficient multi-layer perceptron (MLP) classifier. The architecture has demonstrated improved accuracy over standard convolutional baselines, achieving 91.72% on AID, making it a notable advancement in aerial image analysis (Das et al., 26 Jan 2026).

1. Architectural Overview

Aerial-Y-Net processes a 224×224×3224 \times 224 \times 3 RGB image input through two parallel convolutional branches, each termed ARNet. ARNet1_1 and ARNet2_2 are structurally identical except for their kernel sizes, employing 3×33 \times 3 and 5×55 \times 5 convolutions, respectively. Each branch follows four convolutional blocks, progressively expanding channel depth (64, 128, 256, 512) and reducing spatial dimensions via 2×22 \times 2 max pooling. The branches generate feature maps of size 14×14×51214 \times 14 \times 512.

Both early and deep features are fused through distinct mechanisms. After the first conv block, outputs are merged in the FuSAM spatial-attention module. At the deepest layer, feature tensors are concatenated along the channel axis to yield a 14×14×102414 \times 14 \times 1024 composite representation. This is globally averaged, producing a $1024$-dimensional latent vector for classification. The classifier MLP comprises two fully connected layers with dropout, followed by a softmax layer (30 classes).

2. Mathematical Formulation

Spatial Attention Fusion (FuSAM)

FuSAM operates on the concatenated outputs of the first convolutional block from both branches, F11,F21R112×112×64F_1^1, F_2^1 \in \mathbb{R}^{112 \times 112 \times 64}, concatenated into F121R112×112×128F_{12}^1 \in \mathbb{R}^{112 \times 112 \times 128}. A dilated 3×33 \times 3 convolution (d=2d=2) projects F121F_{12}^1 onto an intermediate map AA, which is passed through a sigmoid activation σ\sigma to produce a spatial attention map MsR112×112×1M_s \in \mathbb{R}^{112 \times 112 \times 1}.

A=Convd=2(F121;W) Ms=σ(A)A = \text{Conv}_{d=2}(F_{12}^1; W) \ M_s = \sigma(A)

The original feature maps are recalibrated by channel-wise Hadamard product:

F11=F11Ms F21=F21MsF_1^{1'} = F_1^1 \odot M_s \ F_2^{1'} = F_2^1 \odot M_s

Multi-Scale Feature Fusion

The deepest block outputs, F14,F24R14×14×512F_1^4, F_2^4 \in \mathbb{R}^{14 \times 14 \times 512}, are concatenated:

Fcat=Concat(F14,F24)R14×14×1024F_\text{cat} = \text{Concat}(F_1^4, F_2^4) \in \mathbb{R}^{14 \times 14 \times 1024}

Global average pooling collapses the tensor spatially to a $1024$-dimensional vector, which is then forwarded to the classification module. No additional learnable weighting is applied at this fusion stage.

3. Training Regime

Extensive preprocessing and data augmentation are applied to increase model robustness. Images are resized to 224×224×3224 \times 224 \times 3 and normalized to [0,1][0,1]. Augmentation strategies include horizontal/vertical flips (p=0.5p=0.5), 9090^\circ rotations (p=0.5p=0.5), brightness/contrast adjustments (p=0.3p=0.3), RGB channel shifts (p=0.5p=0.5), and median blur (p=0.4p=0.4).

Optimization utilizes categorical cross-entropy loss with the Adam optimizer. The learning rate is initialized at 1×1031 \times 10^{-3} and annealed using cosine scheduling with warm restarts every $50$ epochs (min LR: 1×1061 \times 10^{-6}). Training proceeds with batch size $32$ for $200$ epochs. Dropout is employed (30% before first FC, 20% before second FC) to mitigate overfitting.

4. Empirical Performance

Aerial-Y-Net achieves an overall test accuracy of 91.72% on the AID dataset, with precision 91.80%, recall 91.72%, and F1F_1-score 91.70%. These results surpass widely adopted CNN baselines, such as VGG-16 (89.64%), GoogLeNet (86.39%), and CaffeNet (89.53%) under identical conditions, representing an improvement of approximately 2 percentage points over the best prior result.

Model Accuracy (%) Precision (%) Recall (%)
Aerial-Y-Net 91.72 91.80 91.72
VGG-16 89.64
GoogLeNet 86.39
CaffeNet 89.53

No ablation study quantifying individual contributions of FuSAM or dual-kernel branches was reported.

5. Qualitative Evaluation

Accuracy and loss curves over 200 epochs demonstrate smooth convergence with no evidence of significant overfitting. t-SNE visualization of the $1024$-dimensional latent representations reveals class structure, with residual confusion among visually similar aerial categories. ROC curves for all $30$ classes indicate that most achieve area under curve (AUC) >0.90>0.90, though some (notably “School”) are as low as $0.68$.

Although explicit heat-maps for the spatial attention module are not presented, FuSAM's design implies that the network accentuates discriminative regions (such as rooftops or roads) in the early stage of feature extraction.

6. Technical Context and Implications

Aerial-Y-Net belongs to a class of attention-augmented, multi-scale CNNs tailored for heterogeneous aerial imagery. Dual-branch design with complementary kernel sizes (local and contextual cue capture) reflects a trend in remote sensing towards hybrid architectures for robust feature representation. The spatial attention mechanism addresses the challenge of heterogeneity by guiding feature recalibration based on salient regions.

This suggests that future architectures may further leverage such early-stage attention modules and multi-scale fusion for improved discrimination, particularly in the context of diverse, visually complex datasets. A plausible implication is that the absence of ablation studies leaves open questions regarding the individual efficacy of attention modules versus dual-scale convolution.

7. Comparative Analysis and Future Directions

In terms of benchmarked accuracy, Aerial-Y-Net advances the state-of-the-art for AID, rivaling established deep models. The integration of aggressive data augmentation and a cosine annealing schedule contribute to robust optimization and generalization as evidenced by convergence profiles and classification metrics.

A plausible implication is that architectural extensions, such as alternative attention mechanisms, explicit weighting in feature fusion, or integration of handcrafted features (e.g., SIFT, LBP), may further reduce the observed inter-class confusion. Absence of heat-map and ablation investigations points to fertile ground for subsequent research on interpretability and architectural disentanglement in aerial scene analysis (Das et al., 26 Jan 2026).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Aerial-Y-Net.