Aerial-Y-Net: Dual-Branch CNN for Aerial Scenes
- Aerial-Y-Net is a dual-branch CNN architecture for aerial scene classification that leverages spatial attention and multi-scale feature fusion.
- It employs parallel ARNet branches with 3×3 and 5×5 convolutions and integrates FuSAM to recalibrate early features for enhanced discrimination.
- Empirical results on the AID dataset show a 91.72% accuracy, outperforming models like VGG-16 and GoogLeNet, highlighting its robust performance.
Aerial-Y-Net is a dual-branch convolutional neural network architecture designed for aerial scene classification, with a focus on spatial attention and multi-scale feature fusion. Developed within the context of benchmarking performance on the AID dataset, Aerial-Y-Net integrates a two-stream backbone, an attention-driven fusion module, and an efficient multi-layer perceptron (MLP) classifier. The architecture has demonstrated improved accuracy over standard convolutional baselines, achieving 91.72% on AID, making it a notable advancement in aerial image analysis (Das et al., 26 Jan 2026).
1. Architectural Overview
Aerial-Y-Net processes a RGB image input through two parallel convolutional branches, each termed ARNet. ARNet and ARNet are structurally identical except for their kernel sizes, employing and convolutions, respectively. Each branch follows four convolutional blocks, progressively expanding channel depth (64, 128, 256, 512) and reducing spatial dimensions via max pooling. The branches generate feature maps of size .
Both early and deep features are fused through distinct mechanisms. After the first conv block, outputs are merged in the FuSAM spatial-attention module. At the deepest layer, feature tensors are concatenated along the channel axis to yield a composite representation. This is globally averaged, producing a $1024$-dimensional latent vector for classification. The classifier MLP comprises two fully connected layers with dropout, followed by a softmax layer (30 classes).
2. Mathematical Formulation
Spatial Attention Fusion (FuSAM)
FuSAM operates on the concatenated outputs of the first convolutional block from both branches, , concatenated into . A dilated convolution () projects onto an intermediate map , which is passed through a sigmoid activation to produce a spatial attention map .
The original feature maps are recalibrated by channel-wise Hadamard product:
Multi-Scale Feature Fusion
The deepest block outputs, , are concatenated:
Global average pooling collapses the tensor spatially to a $1024$-dimensional vector, which is then forwarded to the classification module. No additional learnable weighting is applied at this fusion stage.
3. Training Regime
Extensive preprocessing and data augmentation are applied to increase model robustness. Images are resized to and normalized to . Augmentation strategies include horizontal/vertical flips (), rotations (), brightness/contrast adjustments (), RGB channel shifts (), and median blur ().
Optimization utilizes categorical cross-entropy loss with the Adam optimizer. The learning rate is initialized at and annealed using cosine scheduling with warm restarts every $50$ epochs (min LR: ). Training proceeds with batch size $32$ for $200$ epochs. Dropout is employed (30% before first FC, 20% before second FC) to mitigate overfitting.
4. Empirical Performance
Aerial-Y-Net achieves an overall test accuracy of 91.72% on the AID dataset, with precision 91.80%, recall 91.72%, and -score 91.70%. These results surpass widely adopted CNN baselines, such as VGG-16 (89.64%), GoogLeNet (86.39%), and CaffeNet (89.53%) under identical conditions, representing an improvement of approximately 2 percentage points over the best prior result.
| Model | Accuracy (%) | Precision (%) | Recall (%) |
|---|---|---|---|
| Aerial-Y-Net | 91.72 | 91.80 | 91.72 |
| VGG-16 | 89.64 | — | — |
| GoogLeNet | 86.39 | — | — |
| CaffeNet | 89.53 | — | — |
No ablation study quantifying individual contributions of FuSAM or dual-kernel branches was reported.
5. Qualitative Evaluation
Accuracy and loss curves over 200 epochs demonstrate smooth convergence with no evidence of significant overfitting. t-SNE visualization of the $1024$-dimensional latent representations reveals class structure, with residual confusion among visually similar aerial categories. ROC curves for all $30$ classes indicate that most achieve area under curve (AUC) , though some (notably “School”) are as low as $0.68$.
Although explicit heat-maps for the spatial attention module are not presented, FuSAM's design implies that the network accentuates discriminative regions (such as rooftops or roads) in the early stage of feature extraction.
6. Technical Context and Implications
Aerial-Y-Net belongs to a class of attention-augmented, multi-scale CNNs tailored for heterogeneous aerial imagery. Dual-branch design with complementary kernel sizes (local and contextual cue capture) reflects a trend in remote sensing towards hybrid architectures for robust feature representation. The spatial attention mechanism addresses the challenge of heterogeneity by guiding feature recalibration based on salient regions.
This suggests that future architectures may further leverage such early-stage attention modules and multi-scale fusion for improved discrimination, particularly in the context of diverse, visually complex datasets. A plausible implication is that the absence of ablation studies leaves open questions regarding the individual efficacy of attention modules versus dual-scale convolution.
7. Comparative Analysis and Future Directions
In terms of benchmarked accuracy, Aerial-Y-Net advances the state-of-the-art for AID, rivaling established deep models. The integration of aggressive data augmentation and a cosine annealing schedule contribute to robust optimization and generalization as evidenced by convergence profiles and classification metrics.
A plausible implication is that architectural extensions, such as alternative attention mechanisms, explicit weighting in feature fusion, or integration of handcrafted features (e.g., SIFT, LBP), may further reduce the observed inter-class confusion. Absence of heat-map and ablation investigations points to fertile ground for subsequent research on interpretability and architectural disentanglement in aerial scene analysis (Das et al., 26 Jan 2026).