CLAReSNet: Hybrid Deep Network for HSI
- CLAReSNet is a hybrid deep learning model for hyperspectral image classification that combines multi-scale convolution, bidirectional RNNs, and transformer-style latent attention for robust spectral-spatial analysis.
- It employs innovative spatial and spectral modules, including adaptive latent token bottlenecks and hierarchical cross-attention fusion, to reduce computational complexity and enhance accuracy on benchmarks.
- The architecture effectively addresses challenges such as high spectral dimensionality and class imbalance, achieving near-perfect overall accuracy on datasets like Indian Pines and Salinas.
CLAReSNet (Convolutional Latent Attention Residual Spectral Network) is a hybrid deep learning architecture designed for hyperspectral image (HSI) classification, integrating multi-scale convolution, bidirectional recurrent neural networks, and transformer-style latent attention. The network is structured to address the challenges inherent to HSI, including high spectral dimensionality, intricate spectral-spatial dependencies, and pronounced class imbalance. CLAReSNet achieves state-of-the-art performance on benchmark datasets through a series of innovations in its spatial and spectral modeling modules, complexity reduction via adaptive latent attention, and robust hierarchical fusion (Bandyopadhyay et al., 15 Nov 2025).
1. Multi-Scale Convolutional Feature Extraction
CLAReSNet accepts input batches , typically with spectral bands and pixel patches. The initial stem comprises four parallel 2D convolutional layers with varying kernel sizes (), each outputting 16 feature maps. The resulting 64-channel feature tensor, , is aggregated and recalibrated via an enhanced Squeeze-and-Excitation (SE) block employing both global average and max pooling, followed by a two-layer MLP (reduction ratio 16) and channel-wise rescaling.
Residual spatial hierarchy is introduced through four dilated convolutional residual blocks with dilation rates . Each block applies batch normalization, GELU nonlinearity, dropout, and another SE recalibration, with additive skip connections to promote stable gradient flow.
Attention mechanisms at this stage utilize an augmented Convolutional Block Attention Module (CBAM). Channel attention is computed from concatenated pooled features, while spatial attention draws upon a convolution over concatenated channel-wise statistics (mean, max, std, min). The module concludes with per-band projection to 256-dimensional embeddings, producing with strong localized descriptor properties (Bandyopadhyay et al., 15 Nov 2025).
2. Deep Spectral Encoder and Latent Attention
The spectral encoder section contains three identical stacked layers operating on sequence representations. Each input is processed sequentially by a Bi-LSTM and Bi-GRU, yielding enhanced temporal context. This output is then subjected to Multi-Scale Spectral Latent Attention (MSLA), which dramatically reduces computational complexity from quadratic to nearly linear-logarithmic by introducing an adaptive latent token bottleneck.
MSLA allocates latent tokens according to input sequence length via:
with , , , and . Three scales () are formed by downsampling, supporting hierarchical cross-spectral context aggregation.
The MSLA process is composed of encode (input-to-latent cross-attention), process (latent self-attention with residual feed-forward expansion), and decode (latent-to-output cross-attention) phases (multi-head, heads per block). Outputs from all scales are concatenated, fused by a two-layer FFN, and normalized with layer residuals, yielding an informative fused feature tensor (Bandyopadhyay et al., 15 Nov 2025).
3. Hierarchical Cross-Attention Fusion
For classification, CLAReSNet implements a hierarchical aggregation of multi-level representations through cross-attention. Mean-pooling is applied along the sequence dimension of each encoder output to form summary vectors , with . The final summary for classification, , is constructed by applying cross-attention between the top-level summary and the stack of all summaries, followed by layer normalization.
Brief pseudocode sketch:
1 2 3 4 |
S = stack(mean_pool(H[1]), mean_pool(H[2]), mean_pool(H[3])) # N×3×D Q = mean_pool(H[3]) # N×D A = CrossAttention(Q, K=S, V=S) # N×D f_final = LayerNorm(Q + A) # N×D |
4. Training Methodology and Optimization
CLAReSNet is trained using standard categorical cross-entropy loss. Class imbalance is addressed through stratified sampling and data augmentation (Gaussian noise, random rotations, flips). Optimization uses AdamW (learning rate , weight decay ), with batch sizes 16 (train) and 32 (validation/test). Dropout rates are set to 0.1 (internal modules) and 0.5/0.25 (classification head), with up to 40 training epochs and early stopping based on validation accuracy.
Latent token allocation hyperparameters are , , , . No explicit class-weighting is performed (Bandyopadhyay et al., 15 Nov 2025).
5. Quantitative Performance and Representational Analysis
Extensive evaluation on the Indian Pines and Salinas hyperspectral benchmarks demonstrates CLAReSNet’s competitive advantage.
| Model | OA (IP) | OA (SA) | Avg. Interclass Dist. (IP/SA) |
|---|---|---|---|
| Random Forest | 88.45 | 93.81 | — |
| XGBoost | 92.18 | 95.50 | — |
| HybridSN | 96.53 | 98.22 | 7.29 / 8.10 |
| SSRN | 97.01 | 98.73 | 11.80 / 12.99 |
| SpectralFormer | 73.22 | 89.03 | 19.55 / 18.33 |
| CLAReSNet | 99.71 | 99.96 | 21.25 / 20.98 |
On Indian Pines and Salinas, CLAReSNet attains overall accuracy (OA) of 99.71% and 99.96%, respectively, with Cohen's values of 0.9967 (IP) and 0.9996 (SA). Precision-recall analysis reveals average precision (AP) above 0.90 for most classes and above 0.80 for minority/hard classes, such as Soybean-clean or Grapes-untrained. Embedding quality is corroborated by -SNE visualizations, which display tight intra-class clusters and wide inter-class separation, further substantiated by mean inter-class distances of 21.25 (IP) and 20.98 (SA). Classification maps display sharp, spatially coherent predictions with correctly localized uncertainty at class boundaries (Bandyopadhyay et al., 15 Nov 2025).
6. Model Capacity, Scalability, and Prospective Directions
CLAReSNet consists of approximately 17.3 million parameters and is optimized for moderate patch sizes (). Scaling to very large or ultra-high-resolution imagery may present computational challenges; further reductions in complexity or parameterization would be necessary for applicability to onboard satellite or resource-constrained environments.
Anticipated research directions include integration with new-generation PRISMA hyperspectral datasets, enhanced latent compression, and the development of lightweight CLAReSNet variants for real-time deployment and inference on orbital platforms (Bandyopadhyay et al., 15 Nov 2025).